PARYA DOLATYABI
PARYA DOLATYABI
Department of Computer Science, University of Tulsa, Tulsa, OK 74104, USA
美国俄克拉荷马州塔尔萨市,塔尔萨大学计算机科学系,邮编74104
Corresponding author: Parya Dolatyabi (pad7492@utulsa.edu)
通讯作者:Parya Dolatyabi (pad7492@utulsa.edu)
This work was supported in part by U.S. Department of Transportation (USDOT) under Grant 693JJ32350030.
本工作部分由美国交通部(USDOT)资助,资助编号693JJ32350030。
INDEX TERMS Deep learning, traffic scene understanding, discriminative models, generative models, domain adaptation, classification, object detection, segmentation.
关键词 深度学习,交通场景理解,判别模型,生成模型,领域自适应,分类,目标检测,分割。
The rapid evolution of deep learning (DL), particularly in computer vision, has initiated a new era of intelligent transportation systems. Researchers have made significant strides in advancing autonomous vehicles, traffic management, and pedestrian safety by fusing deep neural architectures with the complexities of traffic scenes. However, despite these advancements, critical challenges persist in effectively translating theoretical breakthroughs into robust, real-world applications, such as handling the variability of traffic environments, ensuring real-time processing, and achieving high accuracy under diverse conditions. This review aims to thoroughly explore these challenges by offering an in-depth analysis of the complex interactions between deep neural networks (DNNs), computer vision, and traffic scene understanding.
深度学习(DL),尤其是在计算机视觉领域的快速发展,开启了智能交通系统的新纪元。研究人员通过将深度神经架构与交通场景的复杂性相结合,在自动驾驶、交通管理和行人安全方面取得了显著进展。然而,尽管取得了这些进展,如何将理论突破有效转化为稳健的实际应用仍面临关键挑战,如应对交通环境的多样性、确保实时处理能力以及在多变条件下实现高精度。本文旨在通过深入分析深度神经网络(DNN)、计算机视觉与交通场景理解之间的复杂交互,全面探讨这些挑战。
Previous studies have made substantial contributions to this field. However, they also exhibit certain limitations. For example, [1] provided an extensive survey on deep learning-based object detection in traffic scenarios, covering over 100 papers and highlighting challenges such as real-time performance, image quality degradation, and object occlusion. Autonomous driving technologies were investigated in [2], with a focus on DL methods for perception, mapping, and sensor fusion. They also pointed out limitations in multi-sensor integration and prediction accuracy. The authors of [3] covered deep learning techniques for object detection, semantic segmentation, instance segmentation, and lane line segmentation in autonomous driving. They highlighted key challenges such as high computational cost, real-time performance limitations, and occlusion issues-particularly in instance segmentation, where region proposal-based methods often struggle with small or occluded objects. Advanced methods, such as Adaptive Feature (AF) pooling, were suggested to improve efficiency in these scenarios. The study in [4] reviews methods based on artificial intelligence (AI), including convolutional neural networks (CNNs) and reinforcement learning, for tasks such as driving scene perception, path planning, and motion control. It discusses challenges including handling occlusion, particularly during scene perception, where occluded objects often hinder accurate detection and recognition. Despite their valuable insights, these studies face several critical shortcomings:
以往研究对该领域做出了重要贡献,但也存在一定局限。例如,[1]对基于深度学习的交通场景目标检测进行了广泛综述,涵盖了100多篇论文,重点指出了实时性能、图像质量下降和目标遮挡等挑战。[2]研究了自动驾驶技术,聚焦于感知、地图构建和传感器融合的深度学习方法,同时指出了多传感器集成和预测精度的不足。[3]涵盖了自动驾驶中的目标检测、语义分割、实例分割和车道线分割的深度学习技术,强调了高计算成本、实时性能限制及遮挡问题,尤其是在实例分割中,基于区域提议的方法常在处理小目标或遮挡目标时表现不佳。文中建议采用自适应特征(Adaptive Feature, AF)池化等先进方法以提升效率。[4]综述了基于人工智能(AI)的方法,包括卷积神经网络(CNNs)和强化学习,用于驾驶场景感知、路径规划和运动控制,讨论了遮挡处理的挑战,特别是在场景感知中,遮挡目标常妨碍准确检测和识别。尽管这些研究提供了宝贵见解,但仍存在若干关键不足:
The associate editor coordinating the review of this manuscript and approving it for publication was Turgay Celik
本稿件的审稿协调编辑并批准发表为Turgay Celik
To address these gaps, our paper studies the core computer vision techniques of classification, object detection, and segmentation, while also extending its analysis to cover advanced topics including action recognition, object tracking, path prediction, anomaly detection, scene generation, and image enhancement. By synthesizing findings from a broad spectrum of studies, our paper provides a holistic overview of the evolution from traditional image processing methods to advanced DL models, including Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Domain Adaptation models. It emphasizes the integration of these models into real-world applications such as autonomous driving, traffic management, and pedestrian safety, while also addressing challenges including occlusions, dynamic urban traffic environments, and varying weather and lighting conditions. The contributions of our paper are:
为填补这些空白,本文研究了分类、目标检测和分割等核心计算机视觉技术,同时扩展分析涵盖动作识别、目标跟踪、路径预测、异常检测、场景生成和图像增强等高级主题。通过综合广泛研究成果,本文提供了从传统图像处理方法到先进深度学习模型(包括卷积神经网络(CNN, Convolutional Neural Networks)、生成对抗网络(GAN, Generative Adversarial Networks)及领域自适应模型)的整体演进概述。强调了这些模型在自动驾驶、交通管理和行人安全等实际应用中的整合,同时探讨了遮挡、动态城市交通环境及多变天气和光照条件等挑战。本文贡献包括:
These contributions make our paper a more thorough and forward-thinking resource, offering a critical assessment of current research, highlighting limitations, and presenting new perspectives on traffic scene understanding. By addressing the shortcomings of previous studies and offering a clear articulation of current challenges faced, this review aims to inspire future research and development efforts that drive innovation in deep learning for traffic scene understanding.
这些贡献使本文成为更全面且具有前瞻性的资源,提供了对当前研究的批判性评估,突出其局限性,并呈现了交通场景理解的新视角。通过弥补以往研究的不足并清晰阐述当前面临的挑战,本综述旨在激励未来的研究与开发,推动深度学习在交通场景理解领域的创新。
The rest of this paper is organized as follows. In Section II, we introduce the discriminative DL models, including CNNs, region-based CNN (R-CNN) variants, YOLO, ViT, DETR (Detection Transformer), graph-based models, and capsule networks (CapsNets). Section III focuses on generative machine learning (ML) models, encompassing GANs, conditional GANs (cGANs), and variational autoencoders (VAEs). In Section IV, we explore DA models within the categories of clustering-based, discrepancy-based, and adversarial-based approaches. A comparative discussion of these models is provided in Section V. HPO techniques are detailed within each category. Finally, future research areas and concluding remarks are presented in Sections VI and VII, respectively.
本文余下部分组织如下。第二节介绍判别式深度学习(DL)模型,包括卷积神经网络(CNNs)、基于区域的CNN(R-CNN)变体、YOLO、视觉变换器(ViT)、检测变换器(DETR,Detection Transformer)、基于图的模型以及胶囊网络(CapsNets)。第三节聚焦生成式机器学习(ML)模型,涵盖生成对抗网络(GANs)、条件生成对抗网络(cGANs)和变分自编码器(VAEs)。第四节探讨域适应(DA)模型,分为基于聚类、基于差异和基于对抗的方法。第五节对这些模型进行比较讨论。各类别中均详细介绍了超参数优化(HPO)技术。最后,第六节和第七节分别呈现未来研究方向和总结性评论。
In this section, we focus on the most popular datasets identified in the papers reviewed in our study. These datasets are widely used across various tasks in traffic scene understanding, including object detection, segmentation, 3D tracking, classification, and domain adaptation. They have been selected based on their frequency of citation, versatility, and relevance to core applications. For less popular, niche, or highly customized datasets, readers are referred to the corresponding references cited in the respective works.
本节重点介绍我们研究中综述论文中识别出的最流行数据集。这些数据集广泛应用于交通场景理解的各类任务,包括目标检测、分割、三维跟踪、分类和域适应。它们的选择基于引用频率、多功能性及与核心应用的相关性。对于较少使用、专业化或高度定制的数据集,读者可参考相关文献中的引用。
Table 1 summarizes 11 widely used datasets in traffic scene understanding, categorized based on their applications and characteristics. COCO 2017, VOC2007, and Cityscapes are benchmarks for object detection and segmentation, offering extensive annotations for diverse object categories and urban scenes. KITTI and nuScenes focus on 3D object detection and multi-object tracking, with KITTI emphasizing structured environments and nuScenes extending to radar data and more dynamic scenarios. GTSRB specializes in traffic sign recognition, providing a targeted dataset for autonomous driving systems. 1043-syn is a synthetic dataset optimized for traffic scene classification, particularly under controlled lighting and object variation scenarios. For person re-identification, DukeMTMC-ReID is a key benchmark, supporting identity matching tasks across multi-camera setups.
表1总结了11个广泛应用于交通场景理解的数据集,按应用和特征分类。COCO 2017、VOC2007和Cityscapes是目标检测和分割的基准,提供丰富的多类别目标和城市场景标注。KITTI和nuScenes聚焦三维目标检测和多目标跟踪,KITTI强调结构化环境,nuScenes则扩展至雷达数据和更动态场景。GTSRB专注于交通标志识别,为自动驾驶系统提供针对性数据集。1043-syn是一个合成数据集,优化用于交通场景分类,特别是在受控光照和目标变化条件下。DukeMTMC-ReID是行人重识别的关键基准,支持多摄像头身份匹配任务。
These datasets vary in scale and context, with real-world datasets like BDD, Mapillary, and Cityscapes capturing diverse weather and geographic conditions, while synthetic datasets like SYNTHIA and 1043-syn simulate controlled scenarios for domain adaptation and classification. Large-scale datasets such as COCO 2017 and BDD provide extensive data for deep learning, whereas smaller datasets like VOC2007 and KITTI offer high-quality annotations for specific tasks. This combination of real-world variability, geographic diversity, and synthetic precision allows researchers to address multifaceted challenges in traffic scene understanding, leveraging the strengths of each dataset for robust model development.
这些数据集在规模和背景上各异,真实世界数据集如BDD、Mapillary和Cityscapes涵盖多样的天气和地理条件,而合成数据集如SYNTHIA和1043-syn模拟受控场景以支持域适应和分类。大规模数据集如COCO 2017和BDD为深度学习提供丰富数据,小规模数据集如VOC2007和KITTI则为特定任务提供高质量标注。真实世界的多样性、地理差异和合成数据的精确性相结合,使研究者能够应对交通场景理解中的多方面挑战,充分利用各数据集优势以开发鲁棒模型。
Discriminative DL models, often based on CNNs, are crucial for understanding complex traffic scenes. They excel at distinguishing objects and patterns, enabling tasks like object detection, classification, and segmentation. In traffic contexts, these models accurately identify vehicles, pedestrians, and road signs, enhancing real-time analysis in video feeds. By leveraging discriminative DL, systems improve road safety and efficiency, assisting autonomous navigation, traffic flow analysis, and pedestrian behavior prediction. This advancement supports intelligent transportation systems and enhances overall road safety.
判别式深度学习模型,通常基于卷积神经网络(CNNs),在理解复杂交通场景中至关重要。它们擅长区分目标和模式,支持目标检测、分类和分割等任务。在交通环境中,这些模型能准确识别车辆、行人和交通标志,提升视频流的实时分析能力。通过利用判别式深度学习,系统增强了道路安全和效率,辅助自动导航、交通流分析及行人行为预测。这一进展支持智能交通系统,提升整体道路安全性。
In the following sections, we examine various discriminative DL models-R-CNNs, YOLO variants, and attention mechanisms-tailored for traffic scene understanding. These models address tasks like object detection, semantic segmentation, and action recognition, influencing intelligent transportation systems. We also discuss HPO for these architectures and compare performance metrics, providing a comprehensive overview.
在接下来的章节中,我们将考察多种判别式深度学习模型——R-CNNs、YOLO变体及注意力机制——专为交通场景理解设计。这些模型处理目标检测、语义分割和动作识别等任务,影响智能交通系统。我们还将讨论这些架构的超参数优化(HPO)并比较性能指标,提供全面概述。
A CNN [5] is a DL model designed for grid-like data (e.g., images), commonly used in traffic scene understanding to analyze camera images. CNNs automatically extract features like objects, signs, and road markings, supporting real-time processing and enhancing road safety for autonomous vehicles.
卷积神经网络(CNN)[5]是一种针对网格状数据(如图像)设计的深度学习模型,常用于交通场景理解中的摄像头图像分析。CNN自动提取目标、标志和道路标线等特征,支持实时处理,提升自动驾驶车辆的道路安全性。
A basic CNN for image classification follows: convolution, pooling, fully connected (FC) layers, and output. Convolution applies filters to generate feature maps, which are pooled and flattened before passing through FC layers, with a softmax layer for class probabilities.
图像分类的基本CNN流程包括:卷积、池化、全连接(FC)层和输出。卷积通过滤波器生成特征图,随后进行池化和展平,传入全连接层,最后通过softmax层输出类别概率。
The core CNN operation is convolution:
CNN的核心操作是卷积:
where
其中
TABLE 1. A summary of the most popular datasets identified in the papers reviewed in our study, categorized based on their characteristics and key features. Approximate numbers are used for dataset sizes to account for variations across versions, releases, or documentation. These datasets are widely adopted for diverse tasks such as object detection, segmentation, 3D tracking, classification, and domain adaptation, with applications spanning real-world scenarios and synthetic simulations. The inclusion of train and test sizes, along with geographic or virtual origins, highlights the diversity and specificity of these datasets in advancing traffic scene understanding. The term Varies in the Image Size column indicates datasets with images of multiple resolutions. Trainva/ refers to a combined set of training and validation images. Fine and Coarse denote levels of annotation granularity, with fine being pixel-accurate and coarse being approximate or less detailed.
表1. 本研究综述论文中识别出的最流行数据集的汇总,基于其特征和关键属性进行分类。数据集规模采用近似数值,以适应不同版本、发布或文档中的差异。这些数据集被广泛应用于目标检测、分割、三维跟踪、分类和领域适应等多种任务,涵盖现实场景和合成模拟。列出训练和测试规模及地理或虚拟来源,突显了这些数据集在推动交通场景理解方面的多样性和针对性。图像尺寸栏中的“Varies”表示数据集包含多分辨率图像。Trainva/ 指训练集与验证集的合并。Fine 和 Coarse 分别表示标注粒度,前者为像素级精确,后者为近似或较粗略。
| Dataset | Abbreviation | Application | Train Size | Test Size | Image Size (pixels) | Location |
| COCO 2017 | Common Objects in Context | Object detection, segmentation, and image captioning. | 118,000 | 41,000 | Varies (e.g., 640 \( \times {480} \) to \( {2048} \times \) 1024) | Global |
| KITTI | Karlsruhe Institute of Technology and Toyota Technological Institute | 3D object detection, multi-object tracking. | 7,481 | 7,518 | \( {1242} \times {375} \) | Karlsruhe, Germany |
| GTSRB | German Traffic Sign Recognition Benchmark | Traffic sign classification and recognition. | 39,209 | 12,630 | Varies ( \( {15} \times {15} \) to \( {250} \times {250}) \) | Germany |
| VOC2007 | PASCAL Visual Object Classes 2007 | Object detection and classification. | 5,011 (trainval) | 4,952 | Varies \( \left( {{500} \times {375}}\right) \) | Europe |
| Cityscapes | Cityscapes | Semantic segmentation of urban scenes. | 3,475 (fine), 20,000 (coarse) | 1,525 (fine) | \( {2048} \times {1024} \) | Multiple cities in Germany |
| nuScenes | nuScenes | 3D object detection, multi-sensor tracking. | 28,130 | 6,008 | \( {1600} \times {900} \) | Boston, USA; Singapore |
| BDD | Berkeley DeepDrive Dataset | Object detection, segmentation, classifica- tion. | 70,000 | 20,000 | \( {1280} \times {720} \) | USA |
| Mapillary | Mapillary Dataset | Street-level semantic segmentation. | 20,000 | 5,000 | Varies | Global |
| SYNTHIA | Synthetic Images for Training | Synthetic data for segmentation, domain adaptation. | 8,000 | 1,400 | \( {960} \times {720} \) | Virtual (synthetic) |
| DukeMTMC- ReID | Duke Multi-Target Multi-Camera Re-ID Dataset | Person re-identification. | 16,522 | 19,889 | \( {128} \times {64} \) | Duke University, USA |
| 1043-syn | 1043 Synthetic Dataset | Synthetic dataset for classification, object recognition. | 8,000 | 2,000 | \( {640} \times {480} \) | Virtual (synthetic) |
| 数据集 | 缩写 | 应用 | 训练集大小 | 测试集大小 | 图像尺寸(像素) | 地点 |
| COCO 2017 | 上下文中的常见物体(Common Objects in Context) | 目标检测、分割和图像描述。 | 118,000 | 41,000 | 变化(例如,640 \( \times {480} \) 到 \( {2048} \times \) 1024) | 全球 |
| KITTI | 卡尔斯鲁厄理工学院(Karlsruhe Institute of Technology)和丰田技术研究院(Toyota Technological Institute) | 三维目标检测、多目标跟踪。 | 7,481 | 7,518 | \( {1242} \times {375} \) | 德国卡尔斯鲁厄 |
| GTSRB | 德国交通标志识别基准(German Traffic Sign Recognition Benchmark) | 交通标志分类与识别。 | 39,209 | 12,630 | 变化(\( {15} \times {15} \) 到 \( {250} \times {250}) \)) | 德国 |
| VOC2007 | PASCAL视觉对象类别2007(PASCAL Visual Object Classes 2007) | 目标检测与分类。 | 5,011(训练验证集) | 4,952 | 变化 \( \left( {{500} \times {375}}\right) \) | 欧洲 |
| Cityscapes | Cityscapes | 城市场景的语义分割。 | 3,475(精细标注),20,000(粗略标注) | 1,525(精细标注) | \( {2048} \times {1024} \) | 德国多个城市 |
| nuScenes | nuScenes | 三维目标检测、多传感器跟踪。 | 28,130 | 6,008 | \( {1600} \times {900} \) | 美国波士顿;新加坡 |
| BDD | 伯克利深度驾驶数据集(Berkeley DeepDrive Dataset) | 目标检测、分割、分类。 | 70,000 | 20,000 | \( {1280} \times {720} \) | 美国 |
| Mapillary | Mapillary数据集 | 街道级语义分割。 | 20,000 | 5,000 | 变化 | 全球 |
| SYNTHIA | 用于训练的合成图像(Synthetic Images for Training) | 用于分割和领域适应的合成数据。 | 8,000 | 1,400 | \( {960} \times {720} \) | 虚拟(合成) |
| DukeMTMC-ReID | 杜克大学多目标多摄像头重识别数据集(Duke Multi-Target Multi-Camera Re-ID Dataset) | 行人重识别。 | 16,522 | 19,889 | \( {128} \times {64} \) | 美国杜克大学 |
| 1043-syn | 1043合成数据集 | 用于分类和目标识别的合成数据集。 | 8,000 | 2,000 | \( {640} \times {480} \) | 虚拟(合成) |
Next, a ReLU activation function introduces non-linearity:
接下来,ReLU激活函数引入非线性:
where
其中
The FC layer performs classification,with output
全连接层执行分类,输出为
where
其中
Some applications of CNNs in traffic scene understanding include early implementations of recognizing traffic signs with 99% accuracy, though the recognition time was relatively long for real-time applications [6]. Modern adaptations have surpassed human performance in tasks like traffic sign recognition [7]. CNNs also enhance free-space detection through data fusion techniques [8] and improve recognition of traffic police gestures [9].
卷积神经网络(CNN)在交通场景理解中的一些应用包括早期实现的交通标志识别,准确率达到99%,但识别时间较长,不适合实时应用[6]。现代改进已在交通标志识别等任务中超越人类表现[7]。CNN还通过数据融合技术提升了自由空间检测[8],并改善了交通警察手势识别[9]。
Shortly after the introduction of CNNs in the 1980s, [6] applied fractal texture segmentation for traffic sign detection using a receptive field neural network (NN). The network had an input layer of
在20世纪80年代CNN引入后不久,[6]利用感受野神经网络(NN)应用分形纹理分割进行交通标志检测。该网络输入层有
In [7], the traditional CNN architecture was modified to incorporate multi-scale features, achieving an accuracy of 99.17% on the GTSRB dataset, surpassing human performance (
[7]中对传统CNN架构进行了修改,加入多尺度特征,在GTSRB数据集上实现了99.17%的准确率,超过了人类表现(
SNE-RoadSeg [8] integrates surface normal estimation (SNE) with a data-fusion CNN architecture for enhanced free-space detection, showcasing a unique dual-encoder system that merges RGB and surface normal information. This fusion, along with densely-connected skip connections in the decoder, enables precise segmentation. On the KITTI benchmark, it achieves an average precision (AP) of 94.07%.
SNE-RoadSeg[8]结合了表面法线估计(SNE)与数据融合CNN架构以增强自由空间检测,展示了独特的双编码器系统,融合了RGB和表面法线信息。该融合及解码器中的密集跳跃连接实现了精确分割。在KITTI基准测试中,平均精度(AP)达到94.07%。
A novel approach to traffic police gesture recognition is proposed in [9], combining a modified Convolutional Pose Machine (CPM) with a Long Short-Term Memory (LSTM) for temporal feature extraction. Enhanced by handcrafted features like Relative Bone Length and Angle with Gravity, it achieves 91.18% accuracy on the TPGR dataset.
[9]提出了一种新颖的交通警察手势识别方法,结合了改进的卷积姿态机(CPM)和长短时记忆网络(LSTM)进行时序特征提取。通过相对骨长和重力角度等手工特征增强,在TPGR数据集上实现了91.18%的准确率。
In this section, we delve into the R-CNN family of models, which build upon the strengths of CNNs by introducing region-based detection for improved precision in complex scenarios such as traffic monitoring. We will explore the evolution of R-CNN models, beginning with Vanilla R-CNN and progressing through Fast R-CNN, Faster R-CNN, and Mask R-CNN, highlighting their advancements and contributions to traffic scene understanding.
本节深入探讨R-CNN系列模型,该系列基于CNN的优势,引入基于区域的检测方法,以提升复杂场景如交通监控中的检测精度。我们将回顾R-CNN模型的发展历程,从基础的Vanilla R-CNN开始,依次介绍Fast R-CNN、Faster R-CNN和Mask R-CNN,重点突出它们在交通场景理解中的进展和贡献。

FIGURE 1. Vanilla R-CNN workflow for object detection in a traffic scene: The process starts with identifying a set of proposed regions that could contain objects. Each proposed region is then passed through a pre-trained CNN to extract features, followed by classification using class-specific SVMs. Finally, bounding boxes are refined to enhance localization accuracy. This workflow demonstrates the ability to accurately detect and classify objects such as cars, poles, and trees, achieving precise object localization and high reliability in real-time traffic monitoring applications.
图1. Vanilla R-CNN在交通场景中的目标检测流程:首先识别一组可能包含目标的候选区域。每个候选区域通过预训练CNN提取特征,随后使用针对类别的支持向量机(SVM)进行分类。最后,边界框被精细调整以提升定位精度。该流程展示了准确检测和分类汽车、电线杆和树木等目标的能力,实现了精确的目标定位和高可靠性的实时交通监控。
Vanilla R-CNN [10] extends traditional CNNs by using region proposals and pretrained CNNs for object detection. It generates region proposals to hypothesize object locations, processes each region through a CNN to extract feature vectors, and classifies these vectors with class-specific SVMs and bounding box regressors. This allows R-CNNs to manage object variability and achieve superior detection performance.
Vanilla R-CNN[10]通过使用区域提议和预训练CNN扩展了传统CNN的目标检测能力。它生成区域提议以假设目标位置,利用CNN处理每个区域提取特征向量,并通过类别特定的SVM和边界框回归器进行分类。这使得R-CNN能够处理目标的多样性并实现优越的检测性能。
Figure 1 illustrates the Vanilla R-CNN process for object detection in a traffic scene. The first step is generating region proposals that may contain objects. If the image is denoted as
图1展示了用于交通场景中目标检测的基础R-CNN流程。第一步是生成可能包含目标的区域候选。如果图像表示为
Each region
每个区域
The extracted feature vectors are classified using class-specific linear SVMs. The score for region
提取的特征向量通过类别特定的线性支持向量机(SVM)进行分类。区域
Finally, the bounding box for each region proposal is refined using a bounding box regressor. This regressor,
最后,使用边界框回归器对每个区域候选的边界框进行精细调整。该回归器
R-CNNs have greatly advanced object detection in traffic scenes, surpassing traditional CNNs with accuracy rates of up to 75.6% on the COCO dataset [11]. They excel in detecting pedestrians [12], vehicles [13], and traffic signs [14], [15]. Innovations like cascaded architectures [16], attention mechanisms [17], and hybrid approaches [18] further improve their performance. These advancements contribute to robust traffic scene understanding, aiding the development of automated driving systems [19].
R-CNN极大推动了交通场景中的目标检测,准确率在COCO数据集[11]上达到75.6%,超越了传统CNN。它在行人[12]、车辆[13]和交通标志[14][15]检测方面表现出色。级联架构[16]、注意力机制[17]和混合方法[18]等创新进一步提升了性能。这些进展促进了交通场景的稳健理解,有助于自动驾驶系统[19]的发展。
R-CNN, introduced in [10], significantly improves object detection accuracy by combining region proposals with CNNs. It addresses occlusion by using region proposals to better localize objects, even when partially occluded, in contrast to sliding window methods like OverFeat.
R-CNN由文献[10]提出,通过结合区域候选与CNN显著提升了目标检测的准确性。它通过使用区域候选更好地定位目标,即使部分遮挡,也能解决遮挡问题,这与滑动窗口方法如OverFeat形成对比。
In a comparative study, R-CNN outperformed traditional CNN on the COCO dataset,achieving
在一项比较研究中,R-CNN在COCO数据集上优于传统CNN,达到
The authors of [12] focus on pedestrian detection, achieving a 23.3% miss rate on the Caltech dataset. They handle occlusion challenges by relying on the model's ability to learn from large datasets instead of explicit occlusion modeling, improving accuracy with more training data.
文献[12]的作者专注于行人检测,在Caltech数据集上实现了23.3%的漏检率。他们通过依赖模型从大规模数据集中学习的能力来应对遮挡挑战,而非显式建模遮挡,随着训练数据增多,准确率得到提升。
A method for traffic sign detection using sparse R-CNN [20] is introduced in [18]. On the BCTSDB and TT-
文献[18]介绍了一种基于稀疏R-CNN[20]的交通标志检测方法。在BCTSDB和TT-
The Bagging R-CNN framework in [17] uses ensemble learning with adaptive sampling to improve object detection in complex traffic scenes. It achieved
文献[17]中的Bagging R-CNN框架利用集成学习和自适应采样提升复杂交通场景中的目标检测性能。使用ResNet50骨干网络时,达到
Context R-CNN, as presented in [19], improves stationary surveillance by selecting and storing objects in memory banks by category. This approach enhanced recognition performance on the TJU-DHD-traffic and Pascal VOC datasets, increasing the mean average precision (mAP) by 0.37 compared to conventional methods.
文献[19]提出的Context R-CNN通过按类别选择并存储对象于记忆库,提升了静态监控性能。该方法在TJU-DHD-traffic和Pascal VOC数据集上的识别性能得到增强,平均精度均值(mAP)较传统方法提高了0.37。
Although Vanilla R-CNN is foundational, its two-stage process is slow and inefficient due to sequential processing of region proposals and classification. This results in high computational complexity and latency, making it unsuitable for real-time applications. Additionally, it struggles with small object detection and relies on selective search, leading to redundant regions, while requiring significant memory and complex training procedures, limiting its practical use.
尽管基础的Vanilla R-CNN具有重要意义,但其两阶段流程因对区域提议和分类的顺序处理而导致速度慢且效率低下。这造成了较高的计算复杂度和延迟,使其不适合实时应用。此外,它在小目标检测方面表现不佳,依赖选择性搜索,导致冗余区域,同时需要大量内存和复杂的训练过程,限制了其实用性。

FIGURE 2. Fast R-CNN procedure for object detection in a traffic scene: The model processes the input image by first extracting features for the entire image using a deep convolutional neural network (Deep ConvNet). RoI projections are mapped onto this shared feature map to generate fixed-size feature vectors using RoI pooling. The resulting RoI feature vectors are passed through fully connected layers to produce two outputs: class probabilities (using a softmax layer for classification) and bounding box regression to refine bounding box coordinates. This enables accurate object detection, such as identifying the “Police Car” class and refining the bounding box parameters in the traffic scene, making it suitable for real-time applications.
图2. 交通场景中Fast R-CNN的目标检测流程:模型首先通过深度卷积神经网络(Deep ConvNet)提取整张图像的特征。RoI(Region of Interest)投影映射到共享特征图上,通过RoI池化生成固定大小的特征向量。得到的RoI特征向量经过全连接层,输出两个结果:类别概率(使用softmax层进行分类)和边界框回归以精细调整边界框坐标。这使得在交通场景中能够准确检测目标,如识别“警车”类别并优化边界框参数,适合实时应用。
Fast R-CNN, introduced in [22], improves R-CNN by using a single forward pass of the image through a CNN to extract feature maps. It classifies object proposals and refines their spatial locations directly from shared feature maps, significantly improving training and testing speed. Fast R-CNN trains the VGG16 network [23] nine times faster and tests 213 times quicker than R-CNN, while achieving higher mAP on the PASCAL VOC 2012 dataset and surpassing SPPnet [24] in accuracy.
Fast R-CNN由文献[22]提出,通过对图像进行单次CNN前向传播提取特征图,改进了R-CNN。它直接从共享特征图对目标提议进行分类和空间位置精细调整,显著提升了训练和测试速度。Fast R-CNN训练VGG16网络[23]的速度比R-CNN快9倍,测试速度快213倍,同时在PASCAL VOC 2012数据集上实现了更高的mAP,并在准确率上超过了SPPnet[24]。
Figure 2 shows the Fast R-CNN approach for traffic scene object detection. Unlike R-CNN, Fast R-CNN uses a single deep CNN,denoted as
图2展示了Fast R-CNN在交通场景目标检测中的方法。与R-CNN不同,Fast R-CNN使用单个深度CNN,记为
Region proposals are generated and mapped onto the shared feature map
生成区域提议并映射到共享特征图
Fast R-CNN uses a softmax layer for classification, unlike R-CNN’s SVMs. The probability
Fast R-CNN使用softmax层进行分类,不同于R-CNN的SVM。区域
where
其中
The final output is a set of refined bounding boxes with class labels. Each bounding box is associated with the highest probability from the softmax layer.
最终输出是一组带有类别标签的精细边界框。每个边界框对应softmax层中概率最高的类别。
Fast R-CNN has been successfully applied in various traffic scene tasks, enhancing detection and classification. It detects road surface signs [25], counts and identifies vehicles in challenging scenarios [26], and improves monitoring at intersections [27]. The technology also enables simultaneous detection of pedestrians and cyclists, excelling on urban datasets [28]. Additionally, it has been adapted for event-based vehicle classification and counting, demonstrating its versatility in dynamic traffic environments [29].
Fast R-CNN已成功应用于多种交通场景任务,提升了检测和分类性能。它能够检测路面标志[25],在复杂场景中计数和识别车辆[26],并改善路口监控[27]。该技术还支持行人和骑行者的同时检测,在城市数据集上表现优异[28]。此外,它被改编用于基于事件的车辆分类和计数,展示了其在动态交通环境中的多功能性[29]。
The authors of [28] propose a unified method for concurrent detection of pedestrians and cyclists using a novel UB-MPR detection proposal and a Fast R-CNN-based model. Tested on the Tsinghua-Daimler dataset, it achieves a recall rate of
文献[28]提出了一种统一方法,利用新颖的UB-MPR检测提议和基于Fast R-CNN的模型,实现行人和骑行者的同时检测。在清华-戴姆勒数据集上测试,召回率达到
In [30], a framework combining deformable part models (DPMs) with CNNs and region proposal networks (RPNs) accelerates Fast R-CNN. Tested on the KITTI car benchmark, it shows
文献[30]提出了结合可变形部件模型(DPMs)、CNN和区域提议网络(RPNs)的框架,加速Fast R-CNN。在KITTI车辆基准测试中,真实正样本的重叠度在Easy、Moderate和Hard设置下均达到
In [27], a road user monitoring system for intersections is presented, combining a GMM-based DL approach with geometric warping. Integrated with Fast R-CNN, it processes 0.92s and
文献[27]介绍了一种用于路口的道路使用者监控系统,结合基于GMM的深度学习方法与几何变形。集成Fast R-CNN后,在MIT和济南数据集上的处理时间分别为0.92秒和
The study in [31] proposes a joint detection framework for pedestrians and cyclists using Fast R-CNN, incorporating techniques like difficult case extraction, multi-layer feature fusion, and shared convolution layers. This deeper architecture outperformed its counterpart, achieving 4.3% and
文献[31]提出了一种基于Fast R-CNN的行人和骑行者联合检测框架,结合了困难样本提取、多层特征融合和共享卷积层等技术。该更深层次的架构优于其对应模型,在行人和骑行者检测中分别实现了4.3%和
In [29], an event-based object detection system using Fast R-CNN with hyperparameter optimization on modified Stanford car and Myanmar cars datasets is introduced. It achieves accurate vehicle classification and counting in real-time event video streaming, with improved accuracy for weddings and precise learning rate assessments on the Myanmar Cars dataset.
文献[29]介绍了一种基于事件的目标检测系统,采用Fast R-CNN并对修改后的斯坦福汽车和缅甸汽车数据集进行了超参数优化。该系统实现了实时事件视频流中的车辆准确分类和计数,在婚礼场景中精度提升,并对缅甸汽车数据集进行了精确的学习率评估。
Fast R-CNN is used in [32] (referred to as "AllLightR-CNN" in our work) to detect moving vehicles in various conditions, such as low light, long shadows, cloudy weather, and dense traffic. It achieves an average computation time of 0.59 seconds with high detection rates, including 98.44% recall,
文献[32]中使用Fast R-CNN(本文称为“AllLightR-CNN”)检测各种条件下的移动车辆,如低光照、长阴影、多云天气和密集交通。其平均计算时间为0.59秒,检测率高,包括98.44%的召回率、
Fast R-CNN improves speed over R-CNN by processing the entire image in a single pass but still relies on time-consuming region proposals, limiting real-time performance. While RoI pooling speeds up processing, it can introduce quantization errors, reducing accuracy, especially for small objects. Additionally, relying on external region proposal methods like Selective Search hinders real-time capabilities. Fast R-CNN also requires large labeled datasets and struggles with high-resolution images, where detailed feature extraction is critical.
Fast R-CNN通过一次性处理整张图像提升了速度,但仍依赖耗时的区域提议,限制了实时性能。虽然RoI池化加快了处理速度,但可能引入量化误差,降低了小目标的准确性。此外,依赖外部区域提议方法如选择性搜索阻碍了实时能力。Fast R-CNN还需要大量标注数据集,并且在高分辨率图像中难以提取细节特征。
Faster R-CNN, first introduced in [33], improves upon its predecessors, R-CNN and Fast R-CNN, by incorporating a Region Proposal Network (RPN). This RPN shares full-image convolutional features with the detection network, allowing for nearly cost-free region proposals.
Faster R-CNN首次在文献[33]中提出,通过引入区域提议网络(RPN)改进了其前身R-CNN和Fast R-CNN。该RPN与检测网络共享全图卷积特征,实现了几乎无成本的区域提议。
Figure 3 illustrates the use of Faster R-CNN for object detection in a traffic scene. Faster R-CNN comprises two main modules: a deep fully convolutional network for proposing regions and a Fast R-CNN detector that classifies these regions. Together, these modules form a unified network for efficient object detection in complex environments like traffic scenes.
图3展示了Faster R-CNN在交通场景中的目标检测应用。Faster R-CNN由两个主要模块组成:用于生成区域提议的深度全卷积网络和用于分类这些区域的Fast R-CNN检测器。两者结合形成一个统一网络,实现复杂环境如交通场景中的高效目标检测。
Given an input image, the first step in Faster R-CNN is to pass it through several convolutional and max pooling layers to produce a shared feature map. If the input image is denoted as
给定输入图像,Faster R-CNN的第一步是通过多个卷积层和最大池化层,生成共享特征图。若输入图像表示为
where
其中
The RPN takes the shared feature map
RPN接收共享特征图
where
其中
The proposed regions are reshaped using an RoI pooling layer to provide a fixed-size input to the FC layers. This process is the same in both Fast R-CNN and Faster R-CNN, with the key difference being the source of the regions: external algorithms in Fast R-CNN and an internal RPN in Faster R-CNN. The reshaping by RoI pooling can be mathematically expressed as:
提议区域通过RoI池化层重塑为固定大小,作为全连接层的输入。此过程在Fast R-CNN和Faster R-CNN中相同,关键区别在于区域来源:Fast R-CNN依赖外部算法,Faster R-CNN则由内部RPN生成。RoI池化的数学表达为:
where
其中
Finally, the reshaped regions are fed into a sequence of FC layers that output the class probabilities and bounding box coordinates. If
最后,重塑后的区域输入一系列全连接层,输出类别概率和边界框坐标。若
Faster R-CNN has advanced traffic scene understanding across several applications, improving environmental perception [34], optimizing traffic sign detection [35], and enhancing recognition of police gestures [36]. It has also refined pedestrian detection [37], boosted traffic surveillance [38], and enabled accurate vehicle categorization in traffic surveys [39]. These advancements improve performance in diverse conditions, aiding autonomous driving [40] and supporting real-time traffic analysis in smart cities [41].
Faster R-CNN推动了交通场景理解的多项应用,提升了环境感知[34],优化了交通标志检测[35],增强了警察手势识别[36]。它还改进了行人检测[37],加强了交通监控[38],并实现了交通调查中车辆分类的准确性[39]。这些进展提升了多样条件下的性能,助力自动驾驶[40],支持智能城市中的实时交通分析[41]。
An improved Faster R-CNN for small object detection is proposed in [42], specifically targeting small traffic signs in the TT100K dataset. It achieves a recall rate of
[42]中提出了一种改进的Faster R-CNN用于小目标检测,专门针对TT100K数据集中的小型交通标志。该方法实现了
In [39], Faster R-CNN is used for vehicle detection in traffic surveys, chosen over SSD [43] and YOLO [44] for its higher accuracy despite slower speed. The authors highlight the advantages of DL over traditional methods, achieving over
[39]中采用Faster R-CNN进行交通调查中的车辆检测,尽管速度较慢,但因其更高的准确率被选用,优于SSD [43]和YOLO [44]。作者强调深度学习(DL)相较传统方法的优势,在未训练环境下实现了车辆检测、排队长度估计和车辆类型分类超过

FIGURE 3. Faster R-CNN workflow for object detection in a traffic scene: The process starts by passing the image through several convolutional layers to generate a shared feature map. The feature maps are then processed by a region proposal network, which produces a set of region proposals with corresponding objectness scores. These proposed regions are reshaped using RoI pooling to ensure a consistent input size for the fully connected layers. Finally, the reshaped regions are classified into specific object categories, such as sign poles and police cars, and adjusted for accurate bounding boxes. localization, resulting in precise detection of various elements in the traffic scene.
图3. Faster R-CNN在交通场景中进行目标检测的工作流程:该过程首先将图像通过多个卷积层生成共享特征图。随后,区域建议网络(RPN)处理特征图,生成一组带有目标置信度分数的区域建议。通过RoI池化对这些建议区域进行重塑,确保全连接层输入尺寸一致。最后,将重塑后的区域分类为特定目标类别,如标志杆和警车,并调整边界框,实现交通场景中各元素的精确定位和检测。
In [45], Faster R-CNN is enhanced for object detection using hard negative sample mining and a two-channel feature network. By treating complex multi-classification tasks as binary classification, the modified approach achieves a 5% accuracy improvement on the KITTI dataset.
[45]中通过困难负样本挖掘和双通道特征网络增强了Faster R-CNN的目标检测能力。通过将复杂的多分类任务转化为二分类,改进方法在KITTI数据集上实现了5%的准确率提升。
The authors of [40] propose an enhanced Faster R-CNN for traffic sign detection, incorporating feature pyramid fusion, deformable convolution, and ROI Align. Tested under various conditions,it achieved mAP scores of
[40]的作者提出了一种增强的Faster R-CNN用于交通标志检测,结合了特征金字塔融合、可变形卷积和ROI Align。在多种条件下测试,分别在晴天、日落和雨天获得了
In [49], an enhanced Faster R-CNN with ResNet50-D, an attention-guided context feature pyramid network (ACFPN), and AutoAugment technology is proposed for traffic sign detection. Benchmarking against methods like SSD [43] and YOLOv3 [47], it achieved 29.8 FPS and 99.5% mAP on the CCTSDB dataset, surpassing other state-of-the-art methods,with competitive results on the TT100K dataset.
[49]中提出了一种结合ResNet50-D、注意力引导上下文特征金字塔网络(ACFPN)和AutoAugment技术的增强型Faster R-CNN用于交通标志检测。在CCTSDB数据集上,较SSD [43]和YOLOv3 [47]等方法实现了29.8帧每秒和99.5%的mAP,超越了其他先进方法,并在TT100K数据集上取得了竞争性结果。
In [50], a correlation model analyzes haze's impact on traffic sign detection and sight distance, using a synthesized GTSDB dataset. The Faster R-CNN model, post-dehazing, achieved 95.11% detection accuracy. Results show that haze intensity inversely affects sight distance and detection, with accuracies of over 93% at 300 meters in light haze, 88%-93% at 100 meters in haze,and
[50]中构建了一个相关模型,分析了雾霾对交通标志检测和视距的影响,使用合成的GTSDB数据集。去雾后的Faster R-CNN模型实现了95.11%的检测准确率。结果表明,雾霾强度与视距和检测准确率呈负相关,轻雾条件下300米处准确率超过93%,雾霾条件下100米处为88%-93%,浓雾条件下50米处为
In [41], the authors discuss the Intelligent Transportation System (ITS)-oriented Information Acquisition Models (IAMs), using the Mirror Traffic dataset and Internet of Things (IoT) to predict traffic conditions and adjust signals in real-time. By comparing Faster R-CNN to R-CNN in a DL context, they found that Faster R-CNN, with an 85.10% recall and 86.79% accuracy, outperforms R-CNN by 6.20%.
[41]中作者讨论了面向智能交通系统(ITS)的信息获取模型(IAMs),利用Mirror Traffic数据集和物联网(IoT)实时预测交通状况并调整信号。通过在深度学习背景下比较Faster R-CNN与R-CNN,发现Faster R-CNN以85.10%的召回率和86.79%的准确率,较R-CNN高出6.20%。
Faster R-CNN enhances speed by integrating region proposal generation through an RPN and improves occlusion handling, enabling better detection of partially obscured or overlapping objects. Despite these advances, it still faces challenges with real-time processing due to the computational demands of the RPN, high memory usage, and the need for extensive training data. The model struggles with small, overlapping, or occluded objects, and its complexity makes implementation and tuning difficult, limiting its use in data-scarce scenarios.
Faster R-CNN通过集成区域建议网络(RPN)实现了区域建议生成的加速,并改进了遮挡处理能力,提升了对部分遮挡或重叠目标的检测效果。尽管如此,由于RPN的计算需求高、内存占用大及需大量训练数据,仍面临实时处理挑战。该模型在小目标、重叠或遮挡目标检测上存在困难,且其复杂性导致实现和调优较为困难,限制了其在数据匮乏场景中的应用。
Mask R-CNN [51] extends Faster R-CNN by adding a branch for object masks alongside class labels and bounding-box offsets. It achieves fine spatial layout extraction via pixel-to-pixel alignment, addressing limitations in Fast and Faster R-CNN. Retaining a two-stage approach, the first stage uses an RPN to propose regions, while the second stage predicts classes, bounding-box offsets, and binary masks for each RoI. RoI Align ensures precise feature alignment, improving segmentation accuracy. By predicting classification, regression, and segmentation in parallel, Mask R-CNN streamlines the multi-stage pipeline, offering a powerful solution for segmentation tasks.
Mask R-CNN [51]在Faster R-CNN基础上增加了一个用于目标掩码的分支,除了类别标签和边界框偏移外。它通过像素级对齐实现精细的空间布局提取,解决了Fast和Faster R-CNN的局限。保持两阶段方法,第一阶段使用RPN生成区域建议,第二阶段对每个RoI预测类别、边界框偏移和二值掩码。RoI Align确保特征精确对齐,提升分割精度。通过并行预测分类、回归和分割,Mask R-CNN简化了多阶段流程,提供了强大的分割解决方案。
Figure 4 shows Mask R-CNN applied to instance segmentation in a traffic scene. Key components unique to Mask R-CNN are highlighted, excluding those shared with Faster R-CNN (e.g., the backbone, RPN, and bounding-box regression). The focus is on its distinctive elements: the RoI Align operation and the mask prediction process.
图4展示了Mask R-CNN在交通场景中应用于实例分割的过程。突出显示了Mask R-CNN特有的关键组件,排除了与Faster R-CNN共享的部分(如骨干网络、区域建议网络(RPN)和边界框回归)。重点介绍其独特的元素:RoI Align操作和掩码预测过程。
After the RPN identifies potential object bounding box locations in the image, RoI Align warps features from the feature map to a fixed-size representation for each RoI without quantization:
在RPN识别出图像中潜在的目标边界框位置后,RoI Align将特征图中的特征变换为每个RoI的固定大小表示,且不进行量化处理:

FIGURE 4. Mask R-CNN procedure for instance segmentation in a traffic scene: The input image is first processed through an RPN to identify regions of interest. These regions are then refined using the RoI Align operation, which ensures precise feature extraction by avoiding quantization effects, leading to more accurate segmentation. The refined features are passed through fully connected layers for class prediction and bounding box regression. Subsequently, the mask prediction process generates detailed binary segmentation masks for each instance using convolutional layers, producing accurate pixel-level masks. This approach provides high-resolution masks for various objects within the traffic scene, such as vehicles, pedestrians, and traffic signs, enabling precise instance segmentation.
图4. Mask R-CNN在交通场景中进行实例分割的流程:输入图像首先通过RPN处理以识别感兴趣区域。随后,利用RoI Align操作对这些区域进行精细调整,该操作通过避免量化效应确保精确的特征提取,从而实现更准确的分割。调整后的特征通过全连接层进行类别预测和边界框回归。随后,掩码预测过程利用卷积层为每个实例生成详细的二值分割掩码,产生精确的像素级掩码。该方法为交通场景中的各种对象(如车辆、行人和交通标志)提供高分辨率掩码,实现精确的实例分割。
where
其中
The mask prediction process uses a Mask Head that generates binary masks for each RoI. In Equation 11,
掩码预测过程使用Mask Head为每个RoI生成二值掩码。在公式11中,
Mask R-CNN has been applied to various tasks, including floodwater detection on roads [52], traffic sign detection and recognition [53], and train safety through improved obstacle identification [54]. It also supports urban traffic management via vehicle contour detection and tracking [55], and enables accurate vehicle counting to manage congestion [56]. Comparative studies confirm its superior performance in vehicle detection and classification compared to other models [57].
Mask R-CNN已应用于多种任务,包括道路洪水检测[52]、交通标志检测与识别[53]以及通过改进障碍物识别提升列车安全[54]。它还支持通过车辆轮廓检测与跟踪实现城市交通管理[55],并能准确计数车辆以缓解拥堵[56]。对比研究证实其在车辆检测与分类方面优于其他模型[57]。
In [52], a Mask R-CNN-based method for floodwater detection achieves 99.2% classification accuracy and 93.0% segmentation precision on the IDRF dataset [58], outperforming a prior approach [59]. For traffic sign detection and recognition, [53] uses a two-phase method with Mask R-CNN for shape-based detection and Xception [60] for classification on 11,074 Taiwanese traffic signs, achieving
在文献[52]中,基于Mask R-CNN的洪水检测方法在IDRF数据集[58]上实现了99.2%的分类准确率和93.0%的分割精度,优于先前方法[59]。对于交通标志检测与识别,[53]采用两阶段方法,利用Mask R-CNN进行基于形状的检测,结合Xception网络[60]对11,074个台湾交通标志进行分类,三角形标志的精度为
The ME Mask R-CNN method [54] improves automated train safety by integrating SSwin-Le Transformer, ME-PAPN, and multiscale enhancements, achieving a 91.3% mAP on the TrainObstacle dataset, 11.1% higher than Mask R-CNN, and an average detection rate of 4.2 FPS. It improves small-target detection by 19.35%, though gains for large and occluded targets are limited due to dataset characteristics.
ME Mask R-CNN方法[54]通过集成SSwin-Le Transformer、ME-PAPN及多尺度增强,提升了自动列车安全性能,在TrainObstacle数据集上实现了91.3%的mAP,比Mask R-CNN高出11.1%,平均检测速度为4.2帧每秒。该方法提升了小目标检测19.35%,但由于数据集特性,对大目标和遮挡目标的提升有限。
A comprehensive comparison [57] of Faster R-CNN, Mask R-CNN, and ResNet-50 (R-CNN) on the 3,200-image RCNNs_Detection dataset (cars and jeeps from Kaggle) shows that Faster R-CNN and Mask R-CNN exceed 80% detection accuracy, while ResNet-50 achieves over 75%. This demonstrates their effectiveness in vehicle detection, classification, and counting.
一项综合比较[57]在包含3,200张图像的RCNNs_Detection数据集(来自Kaggle的汽车和吉普车)上评估了Faster R-CNN、Mask R-CNN和ResNet-50(R-CNN),结果显示Faster R-CNN和Mask R-CNN的检测准确率均超过80%,而ResNet-50超过75%,证明了它们在车辆检测、分类和计数方面的有效性。
Mask R-CNN extends Faster R-CNN with a mask prediction branch, enabling instance segmentation and facilitating the detection of occluded objects by distinguishing overlapping instances. However, this extra branch increases computational and memory demands, complicating real-time use and deployment on resource-constrained devices. The model also requires substantial labeled data, longer training times, and can struggle with small objects and complex scenes. Its increased complexity makes implementation, tuning, and debugging more challenging, particularly for custom applications.
Mask R-CNN在Faster R-CNN基础上增加了掩码预测分支,实现了实例分割,并通过区分重叠实例促进遮挡目标的检测。然而,该额外分支增加了计算和内存开销,限制了其实时应用及在资源受限设备上的部署。该模型还需大量标注数据和较长训练时间,对小目标和复杂场景表现较弱。其复杂性提升了实现、调优和调试的难度,尤其是在定制应用中。
YOLO is a fast and efficient real-time object detection system that predicts detections in a single pass, unlike R-CNN methods relying on region proposals. Introduced in [44], it treats detection as a regression problem, dividing the image into a grid where each cell predicts bounding boxes, confidence scores, and class probabilities. YOLO generalizes well across domains but struggles with precise localization, especially for small objects. Fast YOLO, the then-fastest general-purpose detector, is also introduced.
YOLO是一种快速高效的实时目标检测系统,与依赖区域建议的R-CNN方法不同,它通过单次前向传播完成检测。文献[44]首次提出,将检测视为回归问题,将图像划分为网格,每个网格单元预测边界框、置信度和类别概率。YOLO在跨领域泛化能力强,但在精确定位,尤其是小目标方面存在不足。文中还介绍了当时最快的通用检测器Fast YOLO。

FIGURE 5. YOLO object detection in a traffic scene featuring a police car: The input image is initially divided into an SxS grid, with each cell predicting bounding boxes, confidence scores, and class probabilities. This process culminates in a final detection display that accurately identifies and localizes the police car and other objects within the scene. A comprehensive legend highlights key objects of interest, providing a clear and detailed overview of the detected items. This real-time object detection approach combines speed and accuracy, making it highly effective for dynamic traffic monitoring applications.
图5. YOLO在包含警车的交通场景中的目标检测:输入图像首先被划分为SxS的网格,每个网格单元预测边界框、置信度分数和类别概率。该过程最终生成一个准确识别并定位警车及场景中其他物体的检测结果显示。详尽的图例突出显示关键目标,提供清晰且详细的检测物体概览。这种实时目标检测方法兼具速度与准确性,非常适合动态交通监控应用。
Figure 5 illustrates YOLO applied to a traffic scene. An input image
图5展示了YOLO在交通场景中的应用。输入图像
Bounding boxes are defined by(x,y,w,h,s),where(x,y) are the center coordinates relative to the grid cell,
边界框由(x,y,w,h,s)定义,其中(x,y)是相对于网格单元的中心坐标,
Each grid cell predicts
每个网格单元预测
YOLO applies Non-Maximum Suppression (NMS) to remove redundant overlapping boxes. Predictions are sorted by confidence score,and overlapping boxes (e.g.,IoU > 0.5) with lower scores are removed.
YOLO应用非极大值抑制(NMS)以去除冗余重叠框。预测结果按置信度排序,重叠框(如IoU > 0.5)中置信度较低的被移除。
YOLO’s operation is summarized as
YOLO的操作可总结为
YOLO has been adapted for traffic flow counting [62], traffic light detection [63], and traffic sign recognition [64], [65]. It effectively detects pedestrians and vehicles [66] and operates under varied lighting and weather conditions [67], [68]. Continuous improvements enable better small-target detection and high-resolution video processing [69], [70]. Additionally, YOLO versions support license plate identification [71] and have been compared across models for traffic sign detection and vehicle classification in challenging environments [72], [73].
YOLO已被改编用于交通流量计数[62]、交通信号灯检测[63]和交通标志识别[64],[65]。它能有效检测行人和车辆[66],并能在不同光照和天气条件下运行[67],[68]。持续改进提升了小目标检测和高分辨率视频处理能力[69],[70]。此外,YOLO版本支持车牌识别[71],并在复杂环境下对交通标志检测和车辆分类进行了模型比较[72],[73]。
YOLO has seen numerous enhancements and iterations since its inception. The following presents the primary YOLO versions, accompanied by historical insights and a review of relevant scholarly literature.
自诞生以来,YOLO经历了众多改进和迭代。以下介绍主要的YOLO版本,附带历史背景和相关学术文献综述。
Debuted in 2016, this original YOLO model was ground-breaking as it treated object detection as a singular regression problem, enabling it to predict bounding box coordinates and class probabilities from an image in a single pass [44].
2016年首次亮相,该原始YOLO模型开创性地将目标检测视为单一回归问题,使其能够在一次前向传播中预测边界框坐标和类别概率[44]。
The importance of Traffic Light Detection (TLD) for intelligent vehicles and Driving Assistance Systems is highlighted in [63], which applies YOLO to the daySequence1 from the LISA Traffic Light Dataset. The study achieves a 90.49% AUC, a 50.32% improvement over the previous best using Aggregated Channel Features (ACF), and a 58.3% AUC comparable to the ACF configuration. This underscores TLD's critical role in enhancing self-driving car functionality.
文献[63]强调了交通信号灯检测(TLD)对智能车辆和驾驶辅助系统的重要性,应用YOLO于LISA交通信号灯数据集的daySequence1。研究实现了90.49%的AUC,比之前使用聚合通道特征(ACF)方法提升50.32%,且58.3%的AUC与ACF配置相当,凸显了TLD在提升自动驾驶功能中的关键作用。
Launched in 2017, YOLOv2 [46] introduced major improvements, including detection of over 9000 object categories, the "Darknet-19" architecture, anchor boxes for better bounding box prediction, and multi-scale training. By combining detection-labeled data from COCO with classification data from ImageNet [74], the authors enabled joint classification and detection, creating YOLO9000 [46], capable of detecting a vast range of categories.
2017年发布的YOLOv2[46]带来了重大改进,包括检测9000多个物体类别、采用“Darknet-19”架构、引入锚框以提升边界框预测,以及多尺度训练。通过结合COCO的检测标注数据和ImageNet的分类数据[74],作者实现了联合分类与检测,创造了YOLO9000[46],能够检测广泛类别。
An optimized pedestrian and vehicle detection algorithm based on YOLOv2 is introduced in [66], which improves accuracy while maintaining efficiency. Comparative results on the KITTI dataset demonstrate its real-time capability, outperforming Faster R-CNN and YOLO V2 with 45% accuracy for pedestrians and
文献[66]提出了一种基于YOLOv2的优化行人和车辆检测算法,在保持效率的同时提升准确率。KITTI数据集上的对比结果显示其具备实时能力,行人检测准确率达45%,车辆检测为
Unveiled in 2018, YOLOv3 employed three distinct sizes of anchor boxes for predictions across three scales. It utilized a deeper architecture, "Darknet-53," and expanded its object category detection capabilities. Additionally, it adopted three different sizes of detection kernels
YOLOv3于2018年发布,采用三种不同尺寸的锚框(anchor boxes)在三个尺度上进行预测。它使用了更深的网络结构“Darknet-53”,并扩展了其目标类别检测能力。此外,还采用了三种不同尺寸的检测核
A YOLOv3-based traffic sign recognition system introduced in [64] achieves a
文献[64]中提出的基于YOLOv3的交通标志识别系统,在GTSDB数据集上的检测mAP达到
Released in 2020, YOLOv4 incorporated numerous improvements over its predecessors. It integrated features such as the "CSPDarknet53-PANet-SPP" architecture, PANet, and SAM block. Additionally, it employed the Complete IoU (CIoU) loss and the Mish activation function, aiming to enhance both speed and accuracy [75].
YOLOv4于2020年发布,相较于前代版本进行了多项改进。它集成了“CSPDarknet53-PANet-SPP”架构、PANet和SAM模块。此外,采用了Complete IoU (CIoU) 损失函数和Mish激活函数,旨在提升速度和准确率[75]。
In [76], traffic sign detection and recognition for smart vehicles are explored using YOLOv4 and YOLOv4-tiny [75] integrated with Spatial Pyramid Pooling (SPP). Results show Yolo V4_1 (with SPP) achieving 99.4% accuracy and 99.32% mAP, while Yolov3 [47] SPP attains 98.99% mAP. These findings indicate that SPP enhances model performance.
文献[76]中,利用YOLOv4和YOLOv4-tiny[75]结合空间金字塔池化(Spatial Pyramid Pooling, SPP)技术,探讨了智能车辆的交通标志检测与识别。结果显示,带SPP的Yolo V4_1实现了99.4%的准确率和99.32%的mAP,而带SPP的Yolov3[47]则达到98.99%的mAP。这些结果表明SPP提升了模型性能。
To address environmental challenges like light intensity, extreme weather, and distance, TSR-YOLO [68], based on YOLOv4-tiny, incorporates Better-ECA (BECA), dense SPP networks, and k-means++ clustering for optimal prior boxes. On the CCTSDB2021 dataset,it achieves 96.62% accuracy, 79.73% recall,an 87.37% F1-score,and a 92.77% mAP, improving over YOLOv4-tiny while maintaining 81 FPS.
为应对光照强度、极端天气和距离等环境挑战,基于YOLOv4-tiny的TSR-YOLO [68]引入了改进的ECA(Better-ECA,BECA)、密集SPP网络和k-means++聚类以优化先验框。在CCTSDB2021数据集上,其准确率达96.62%,召回率79.73%,F1分数87.37%,mAP为92.77%,在保持81 FPS的同时优于YOLOv4-tiny。
A novel semi-automatic method, combining a modified YOLOv4 and background subtraction, is introduced in [77] for unsupervised object detection in surveillance videos. It significantly increases mAP and outperforms state-of-the-art results on the CDnet 2014 and UA-DETRAC datasets, achieving
文献[77]提出了一种结合改进YOLOv4和背景减除的半自动方法,用于监控视频中的无监督目标检测。该方法显著提升了mAP,并在CDnet 2014和UA-DETRAC数据集上超越了最先进的结果,在夜间街角场景中实现了
Introduced in 2020, YOLOv5 [61], developed independently, is not an official continuation by the original YOLO creators. It features a modified "CSPDarknet53" backbone with architectural optimizations to improve speed and real-world applicability. Its naming sparked controversy, as it was not created by the original authors.
YOLOv5 [61]于2020年推出,独立开发,非原YOLO作者的官方续作。其采用改进的“CSPDarknet53”主干网络,进行了架构优化以提升速度和实际应用性能。其命名引发争议,因为并非由原作者创建。
To improve vehicle detection in traffic surveillance videos, [69] proposes an enhanced YOLOv5s model with a small target detection layer and Atrous SPP (ASPP) for multi-scale context, achieving 93.7% precision, 94.2% recall, and 93.9% mAP@0.5—improvements of 0.8%, 1.9%, and 2.3% over the original YOLOv5s, reducing missed and false detections.
为提升交通监控视频中的车辆检测,文献[69]提出了增强版YOLOv5s模型,增加了小目标检测层和空洞空间金字塔池化(Atrous SPP,ASPP)以实现多尺度上下文感知,达到了93.7%的精度、94.2%的召回率和93.9%的mAP@0.5,分别较原YOLOv5s提升0.8%、1.9%和2.3%,减少了漏检和误检。
Ghost-YOLO [70] is a lightweight model for traffic sign detection using the C3Ghost module to replace YOLOv5's feature extraction. It achieves
Ghost-YOLO [70]是一种用于交通标志检测的轻量级模型,采用C3Ghost模块替代YOLOv5的特征提取部分。其在实现
Introduced in September 2022, YOLOv6 boasts an efficient design comprising a backbone with RepVGG [78] or the newly introduced "CSPStackRep" blocks, a Path Aggregation Networks (PAN) topology neck, and a decoupled head with a hybrid-channel strategy. It uses advanced quantization techniques, such as post-training quantization and channel-wise distillation, leading to swifter and more precise detectors [79].
YOLOv6于2022年9月发布,设计高效,包含采用RepVGG [78]或新引入的“CSPStackRep”模块的主干网络,路径聚合网络(PAN)结构的颈部,以及采用混合通道策略的解耦头。其使用了先进的量化技术,如训练后量化和通道级蒸馏,提升了检测器的速度和精度[79]。
A license plate identification algorithm is outlined in [71] based on the YOLOv6 convolution model, enhancing efficiency with a 94.7% precision rate for location and a proposed BLPNET(VGG-19-RESNET-50) model achieving 100% F1- score in character recognition, leading to reduced costs and improved traffic management effectiveness.
文献[71]基于YOLOv6卷积模型提出了一种车牌识别算法,通过提高定位效率实现了94.7%的精度,并提出了BLPNET(结合VGG-19和RESNET-50)的字符识别模型,F1分数达到100%,从而降低了成本并提升了交通管理效果。
Released in July 2022, YOLOv7 set new object detection benchmarks, excelling in speed (5)-160 FPS) and accuracy. Trained solely on the MS COCO dataset without pre-trained backbones, it introduced architectural modifications and "bag-of-freebies" to boost accuracy without sacrificing inference speed, though training time increased [80].
YOLOv7于2022年7月发布,刷新了目标检测基准,在速度(5-160 FPS)和精度方面表现卓越。该模型仅在MS COCO数据集上训练,未使用预训练主干网络,采用了架构改进和“免费礼包”(bag-of-freebies)策略,在不牺牲推理速度的前提下提升了准确率,但训练时间有所增加[80]。
An enhanced YOLOv7-WCN network for traffic sign detection [81] improves accuracy from 85.5% to 89.0% by integrating Horblock modules with convolutional layers for efficient mapping, a normalization-based attention module (NAM), and replacing CIoU loss with Wasserstein distance loss [82].
针对交通标志检测,文献[81]提出了增强版YOLOv7-WCN网络,通过将Horblock模块与卷积层结合实现高效映射,采用基于归一化的注意力模块(NAM),并用Wasserstein距离损失替代CIoU损失,将准确率从85.5%提升至89.0%[82]。
Unveiled in January 2023, YOLOv8 [83] offers five scaled versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv81 (large), and YOLOv8x (extra large), using a backbone similar to the one used in YOLOv5 but with some modifications on the cross-stage partial (CSP) layer. This iteration supports a wide range of computer vision tasks, including object detection, segmentation, pose estimation, tracking, and classification.
YOLOv8于2023年1月发布,提供五个不同规模版本:YOLOv8n(纳米)、YOLOv8s(小型)、YOLOv8m(中型)、YOLOv81(大型)和YOLOv8x(超大型),其主干网络类似于YOLOv5,但对跨阶段部分(CSP)层进行了部分修改。该版本支持广泛的计算机视觉任务,包括目标检测、分割、姿态估计、跟踪和分类。
To address road accidents caused by human error, [72] proposes a traffic sign detection method using YOLOv5s6 [61] and YOLOv8s [83]. Testing on TT100k, TWTS, and a hybrid dataset shows YOLOv8s outperforms YOLOv5s6, achieving an mAP@. 5 of
为解决由人为错误引发的交通事故,[72]提出了一种基于YOLOv5s6 [61]和YOLOv8s [83]的交通标志检测方法。在TT100k、TWTS及混合数据集上的测试表明,YOLOv8s优于YOLOv5s6,在混合数据集上实现了mAP@.5为
Recent studies have compared object detection algorithms across diverse environments and datasets. Reference [84] evaluated Faster R-CNN, YOLOv3, and YOLOv4 for aerial car detection on Stanford and PSU datasets, highlighting the impact of dataset characteristics and parameters like input size and learning rate on accuracy. On the PSU dataset, YOLOv3 and YOLOv4 achieved AP scores of 0.965, outperforming Faster R-CNN (0.739).
近期研究比较了不同环境和数据集上的目标检测算法。文献[84]评估了Faster R-CNN、YOLOv3和YOLOv4在斯坦福和PSU数据集上的航空车辆检测,强调了数据集特性及输入尺寸、学习率等参数对准确率的影响。在PSU数据集上,YOLOv3和YOLOv4的AP得分均为0.965,优于Faster R-CNN的0.739。
A broader evaluation of SSD, Faster R-CNN, and YOLO versions (YOLOv8, YOLOv7, YOLOv6, YOLOv5) on nine datasets (referred to as "ShokriCollection_DS" in our work) featuring varied road challenges was conducted in [73]. YOLO, particularly YOLOv7, excelled with over 95% detection accuracy and
文献[73]对SSD、Faster R-CNN及多个YOLO版本(YOLOv8、YOLOv7、YOLOv6、YOLOv5)在九个包含多样道路挑战的数据集(本研究称为“ShokriCollection_DS”)上进行了更广泛的评估。YOLO,尤其是YOLOv7,在车辆分类中表现出超过95%的检测准确率和
Another comparison [85] shows that YOLOv6 (73.5 mAP) and YOLOv8 (71.8 mAP) significantly outperform DETR (
另一项比较[85]显示,在RSUD20K数据集上用于自动驾驶道路场景理解的多种先进目标检测器和大型计算机视觉模型图像标注器的基准测试中,YOLOv6(73.5 mAP)和YOLOv8(71.8 mAP)显著优于DETR(
YOLOv1 [44] and YOLOv2 [46] struggled with occlusions due to coarse feature extraction, limiting the detection of overlapping objects. YOLOv3 [47] improved this with multi-scale detection, and YOLOv4 [75] enhanced occlusion handling using feature pyramid networks. Despite these advances, challenges persist. CCW-YOLO [86] improved detection in dense scenes with a lightweight convolutional layer and C2f module,while HCLT-YOLO [87] used a hybrid CNN and transformer to reduce false alarms and missed detections.
YOLOv1 [44]和YOLOv2 [46]因特征提取粗糙,在遮挡情况下表现不佳,限制了对重叠目标的检测。YOLOv3 [47]通过多尺度检测有所改进,YOLOv4 [75]利用特征金字塔网络增强了遮挡处理能力。尽管如此,挑战依然存在。CCW-YOLO [86]通过轻量卷积层和C2f模块提升了密集场景的检测效果,而HCLT-YOLO [87]采用混合CNN与Transformer结构,减少了误报和漏检。
YOLO is fast and efficient for real-time object detection due to its single-shot approach, processing an entire image in one pass. However, this design can struggle with detecting small objects, as the grid-based prediction may lack the precision needed for finer details. YOLO also has difficulty handling overlapping instances, where objects are close together, leading to potential inaccuracies. Additionally, its emphasis on speed may trade off some accuracy, particularly in complex scenes with multiple objects or intricate backgrounds, where precise localization and classification become more challenging.
YOLO因其单次检测(single-shot)方法,能够一次性处理整张图像,因而在实时目标检测中速度快且高效。然而,该设计在检测小目标时存在困难,因为基于网格的预测可能缺乏细节的精确度。YOLO在处理目标重叠(即目标彼此靠近)时也存在挑战,可能导致检测不准确。此外,其对速度的强调可能以牺牲部分准确率为代价,尤其是在多目标或复杂背景的场景中,精确定位和分类更具挑战性。
ViT, introduced in [88], applies transformers to image classification by processing images as patch sequences. Using self-attention, it captures complex dependencies and prioritizes visible features and context, more effectively handling occlusions by inferring hidden objects. Unlike CNNs with localized receptive fields, ViT captures long-range dependencies across the entire image, marking a significant shift in image processing.
ViT(视觉Transformer),由文献[88]提出,将Transformer应用于图像分类,通过将图像视为一系列补丁序列进行处理。利用自注意力机制,ViT捕捉复杂的依赖关系,优先关注可见特征和上下文信息,更有效地处理遮挡,通过推断隐藏目标实现更佳表现。与局部感受野的卷积神经网络(CNN)不同,ViT能够捕获整个图像的长距离依赖,标志着图像处理方式的重大转变。
Figure 6 illustrates the application of ViT to classification in a traffic scene. The initial step in ViT is Image Patching which involves splitting an image into a series of fixed-size patches. These patches are then flattened and linearly embedded. Additionally, position embeddings are added to retain positional information:
图6展示了ViT在交通场景分类中的应用。ViT的第一步是图像分块(Image Patching),即将图像拆分为一系列固定大小的补丁。这些补丁随后被展平并进行线性嵌入。此外,还加入位置嵌入以保留位置信息:
where
其中,
The embedded patches then pass through a series of Transformer encoder layers. Each layer comprises two main parts: a multi-headed self-attention mechanism (MHSA) and a position-wise feed-forward network (FFN).
嵌入后的补丁随后通过一系列Transformer编码器层。每层包含两个主要部分:多头自注意力机制(MHSA)和逐位置前馈网络(FFN)。
The MHSA is a self-attention mechanism that allows the model to weigh the importance of different patches when processing each patch:
MHSA是一种自注意力机制,允许模型在处理每个补丁时权衡不同补丁的重要性:
where
其中,
The attention output is processed through multiple "heads," and the results from these heads are concatenated, represented by the symbol
注意力输出通过多个“头”处理,这些头的结果被拼接,符号表示为
where the
第
and
其中
The position-wise FFN consists of two linear transformations with a ReLU activation in between:
位置逐点前馈神经网络(FFN)由两个线性变换组成,中间夹有ReLU激活函数:
where
其中
After passing through the Transformer layers, the class token's output (from the final layer) is used to predict the class of the image via a simple linear layer:
经过Transformer层后,使用类别标记(class token)在最终层的输出,通过一个简单的线性层预测图像的类别:
where
其中

FIGURE 6. Application of ViT in classifying a traffic scene with a crosswalk: The input image is divided into fixed-size patches, which are then flattened and linearly projected into an embedding space. Position embeddings are added to these patch embeddings to retain spatial relationships, along with a class embedding to represent the entire image. The combined embeddings are sequentially processed through the Transformer encoder, involving multiple layers of multi-head self-attention and feed-forward networks. The class token output from the final Transformer layer is passed through an MLP head to predict the class label, in this case, 'crosswalk'
图6. ViT在分类带有斑马线的交通场景中的应用:输入图像被划分为固定大小的图像块,随后被展平并线性投影到嵌入空间。位置嵌入被加到这些图像块嵌入中以保留空间关系,同时加入一个类别嵌入以表示整张图像。组合后的嵌入依次通过Transformer编码器,包含多层多头自注意力和前馈网络。最终Transformer层的类别标记输出通过多层感知机(MLP)头预测类别标签,此处为“斑马线”。
Some applications of ViTs include detecting rain and road surface conditions [89], predicting pedestrian crossing intentions [90], identifying critical traffic moments [91], and detecting unusual traffic scenarios [92].
ViT的一些应用包括检测降雨和路面状况[89]、预测行人过街意图[90]、识别关键交通时刻[91]以及检测异常交通场景[92]。
A cost-effective method for detecting rain and road conditions using ViTs and a Spatial Self-Attention network is presented in [89], achieving F1-scores of 91.13% for rain and 92.10% for road conditions. Adding a sequential detection module improved accuracy to 96.74% and 98.07%, respectively. The study's dataset, referred to as "ViT_DS" in our work, includes 10,000 freeway images from CCTV cameras in Orlando, Florida, labeled for 3 rain levels and 2 road condition levels.
[89]提出了一种结合ViT和空间自注意力网络的经济高效降雨及路况检测方法,降雨和路况的F1分数分别达到91.13%和92.10%。加入序列检测模块后,准确率分别提升至96.74%和98.07%。该研究的数据集(本文称为“ViT_DS”)包含来自佛罗里达奥兰多CCTV摄像头的1万张高速公路图像,标注了3个降雨等级和2个路况等级。
Action-ViT [90] integrates multimodal data-including visual cues, poses, bounding boxes, and action annotations-and employs tailored data processing for each modality, enhancing pedestrian crossing intention prediction and achieving a
Action-ViT[90]融合了多模态数据——包括视觉线索、姿态、边界框和动作注释,并针对每种模态采用定制数据处理,提升了行人过街意图预测,在JAAD数据集上取得了
ViT-TA [91] is a custom ViT that achieves 94% accuracy in detecting critical moments at Time-To-Collision (TTC)
ViT-TA[91]是一种定制ViT,在Dashcam事故数据集(DAD)上实现了94%的准确率,用于检测碰撞时间(TTC)
Vit-L [92] detects scenario novelty in traffic using infrastructure images and a triplet autoencoder trained on 70,000 traffic scene and graph pairs in Germany. Enhanced by expert domain knowledge and ViTs, it uses Angle-Based Outlier Detection (ABOD) in the latent space, achieving a 95.6% AUC. The dataset, referred to as "Wurst_DS" in our work and detailed in [93], comprises highway images for outlier model fitting.
Vit-L[92]利用基础设施图像和在德国7万对交通场景与图形对上训练的三元组自编码器检测交通场景的新颖性。结合专家领域知识和ViT,采用基于角度的异常检测(ABOD)在潜在空间中实现,AUC达到95.6%。该数据集(本文称为“Wurst_DS”,详见[93])包含用于异常模型拟合的高速公路图像。
ViTs excel at capturing long-range dependencies in images, allowing for a more holistic understanding of visual data. However, they require large amounts of data and substantial computational power to achieve high performance, making them less accessible in data-limited scenarios. ViTs can struggle with generalizing from smaller datasets, often leading to overfitting or suboptimal results. Additionally, they may be less efficient than CNNs for lower-resolution images, where the advantage of capturing long-range dependencies is diminished, and the computational overhead becomes more pronounced.
ViT擅长捕捉图像中的长距离依赖关系,从而实现对视觉数据的更全面理解。然而,它们需要大量数据和强大计算资源以达到高性能,在数据有限的场景中不够友好。ViT在小规模数据集上往往难以泛化,容易过拟合或表现不佳。此外,对于低分辨率图像,ViT的效率可能不及卷积神经网络(CNN),因为长距离依赖的优势减弱,而计算开销更为显著。
DETR, introduced in [94], is an innovative model for object detection that leverages the Transformer architecture to streamline the process into an end-to-end framework. By treating object detection as a direct set prediction problem, DETR eliminates the need for hand-designed components like non-maximum suppression and anchor generation. The model employs a combination of a CNN for feature extraction and a Transformer for decoding these features into bounding box predictions and class labels in a single forward pass. Leveraging self-attention, DETR models complex relationships between objects and their context, making it particularly effective at detecting and localizing partially occluded objects by interpreting visible fragments within the overall scene. This unified transformer-based approach marks a significant advancement in handling occlusions and simplifying object detection.
DETR[94]是一种创新的目标检测模型,利用Transformer架构将检测过程简化为端到端框架。通过将目标检测视为直接的集合预测问题,DETR消除了非极大值抑制和锚框生成等手工设计组件。该模型结合了用于特征提取的卷积神经网络(CNN)和用于解码特征为边界框预测及类别标签的Transformer,在一次前向传播中完成。借助自注意力机制,DETR建模了目标与其上下文之间的复杂关系,特别擅长通过解析可见碎片检测和定位部分遮挡的目标。这种基于Transformer的统一方法在处理遮挡和简化目标检测方面具有重要突破。

FIGURE 7. DETR object detection in a traffic scene: The process begins with a CNN extracting image features, which are then enhanced with positional encodings to preserve spatial information and processed through a transformer encoder. The encoder employs several layers of self-attention and FFNs to refine these features for improved detection accuracy. The transformer decoder uses a fixed set of learned object queries, combined with the encoded features, to generate predictions for possible objects, including their classes and bounding boxes. Four FFNs are employed to finalize classifications and bounding box coordinates, effectively highlighting detected objects in the scene.
图7. DETR在交通场景中的目标检测:该过程始于卷积神经网络(CNN)提取图像特征,随后通过位置编码增强以保留空间信息,并通过Transformer编码器处理。编码器采用多层自注意力机制和前馈神经网络(FFN)来优化特征,提高检测精度。Transformer解码器使用一组固定的学习目标查询,结合编码特征,生成可能目标的预测,包括类别和边界框。四个FFN用于最终确定分类和边界框坐标,有效突出场景中的检测目标。
Figure 7 depicts the application of DETR to object detection in a traffic scene. DETR starts with feature extraction through which given an input image
图7展示了DETR在交通场景目标检测中的应用。DETR首先通过特征提取,给定输入图像
Positional encodings (PEs) are then added to the feature map
随后在特征图
At the next step, the Transformer encoder processes this enhanced feature map
下一步,Transformer编码器通过多层自注意力和前馈神经网络处理该增强特征图
The Transformer decoder uses a set of fixed learned object queries
Transformer解码器使用一组固定的学习目标查询
where
其中
The outputs of the Transformer decoder are then processed to yield a fixed-size set of predictions, irrespective of the number of objects in the image. This is represented as:
Transformer解码器的输出随后被处理,生成固定大小的预测集合,与图像中目标数量无关。表示为:
where
其中
The loss function is a crucial part of training DETR, incorporating a unique bipartite matching loss to match predicted and ground truth objects, along with classification and bounding box regression losses.
损失函数是训练DETR的关键部分,包含独特的二分匹配损失,用于匹配预测目标与真实目标,同时包括分类损失和边界框回归损失。
Bipartite matching is used to find the optimal permutation of predicted objects
二分匹配用于寻找预测目标
where
其中
The loss function
损失函数
(23)
where
其中
During training, DETR minimizes this loss function to learn the parameters that result in the best predictions of object classes and bounding boxes, tailored to match the true objects in the image as closely as possible. This streamlined approach of direct set prediction and loss minimization via bipartite matching distinctly sets DETR apart in the field of object detection.
在训练过程中,DETR通过最小化该损失函数来学习参数,从而实现对目标类别和边界框的最佳预测,尽可能精确地匹配图像中的真实物体。这种通过二分匹配进行直接集合预测和损失最小化的简化方法,使DETR在目标检测领域独树一帜。
DETR's applications are diverse, including enhancements for detecting traffic signs of various sizes [95], recognizing small or weather-affected signs [96], and accelerating model training [97]. Additionally, DETR enhances object detection for autonomous driving by effectively aligning objects with their respective scenes [98].
DETR的应用多样,包括提升对不同尺寸交通标志的检测[95]、识别小型或受天气影响的标志[96]以及加速模型训练[97]。此外,DETR通过有效地将物体与其对应场景对齐,增强了自动驾驶中的目标检测能力[98]。
An innovative approach to traffic sign detection, DSRA-DETR [95], emphasizes enhanced multiscale detection performance through modules that aggregate features across scales, effectively reducing feature noise, preserving low-level features, and boosting the model's ability to recognize objects at various sizes. This results in significant improvements in detection accuracy with impressive APs of 76.13% and 78.24% on GTSDB and CCTSDB datasets, respectively.
一种创新的交通标志检测方法DSRA-DETR[95],通过跨尺度特征聚合模块强调多尺度检测性能的提升,有效减少特征噪声,保留低层特征,增强模型对不同尺寸物体的识别能力。该方法在GTSDB和CCTSDB数据集上分别取得了76.13%和78.24%的显著AP提升。
MTSDet [96] enhances traffic sign detection by using an Attention Mechanism Network (AMNet) and a Path Aggregation Feature Pyramid Network (PAFPN) for multi-scale feature fusion. It excels at detecting small or weather-affected signs, achieving mAP scores of 92.9% on GTSRB and 94.3% on CTSD.
MTSDet[96]通过引入注意力机制网络(AMNet)和路径聚合特征金字塔网络(PAFPN)实现多尺度特征融合,提升了交通标志检测性能。其在检测小型或受天气影响的标志方面表现出色,在GTSRB和CTSD数据集上分别达到了92.9%和94.3%的mAP。
In [97], a Spatially Modulated Co-Attention (SMCA) mechanism improves DETR by focusing co-attention near initial box estimates and integrating multi-head, scale-selection attention. This yields
在[97]中,空间调制共注意力机制(SMCA)通过聚焦于初始边界框估计附近的共注意力,并整合多头尺度选择注意力,提升了DETR性能。该方法在108个训练周期内实现了
DetectFormer [98] improves autonomous driving object detection by incorporating a ClassDecoder and a Global Extract Encoder (GEE) to enhance category sensitivity and scene alignment. With data augmentation and attention mechanisms, it achieves AP50 and AP75 scores of 97.6% and 91.4%, respectively, on the BCTSDB dataset.
DetectFormer[98]通过引入类别解码器和全局提取编码器(GEE)提升了自动驾驶目标检测的类别敏感性和场景对齐能力。结合数据增强和注意力机制,该方法在BCTSDB数据集上分别实现了97.6%的AP50和91.4%的AP75。
DETR simplifies object detection by eliminating the need for region proposals and streamlining the process with a transformer-based architecture. However, it requires extensive training data to perform well and is computationally intensive, making it challenging to deploy in resource-constrained environments. DETR can also be slower to converge during training, requiring more epochs to reach optimal performance. Additionally, it struggles with detecting small objects in cluttered scenes, where the lack of region proposals can lead to less precise localization and classification.
DETR通过消除区域提议,利用基于Transformer的架构简化了目标检测流程。然而,它需要大量训练数据以达到良好性能,且计算资源消耗较大,难以在资源受限环境中部署。DETR训练收敛较慢,需要更多训练周期以达到最佳表现。此外,在复杂场景中检测小目标时表现欠佳,缺乏区域提议导致定位和分类精度下降。
GNNs are a powerful tool for traffic scene understanding, representing road networks as graphs and capturing spatial-temporal relationships. They enable precise analysis of vehicle trajectories, pedestrian movements, and interactions, aiding tasks like congestion prediction, collision avoidance, and adaptive signal control. By leveraging graph-based methods, GNNs enhance real-time decision-making in intelligent transportation systems, contributing to safer and more efficient urban mobility.
图神经网络(GNN)是交通场景理解的强大工具,将道路网络表示为图结构,捕捉时空关系。它们能够精确分析车辆轨迹、行人运动及其交互,辅助拥堵预测、碰撞避免和自适应信号控制等任务。通过利用基于图的方法,GNN提升了智能交通系统的实时决策能力,促进了更安全、更高效的城市出行。
GCN, first introduced in [99], is a neural network designed for graph-structured data, extending the concept of convolution from grid-like data (e.g.,images) to graphs. A graph
GCN首次在[99]中提出,是一种针对图结构数据设计的神经网络,将卷积的概念从网格状数据(如图像)扩展到图。一个图
The core idea of a GCN is to perform a convolution-like operation on a graph. The graph convolution for a single layer is expressed as:
GCN的核心思想是在图上执行类似卷积的操作。单层图卷积表达为:
where
其中
The normalized adjacency matrix
归一化邻接矩阵
where
其中
A typical GCN model has multiple layers. For instance, a two-layer GCN is:
典型的图卷积网络(GCN)模型包含多层。例如,二层GCN为:
where
其中
The GCN is trained by minimizing the cross-entropy loss:
GCN通过最小化交叉熵损失进行训练:
where
其中
By iteratively updating
通过迭代使用梯度下降更新
GCNs excel in traffic scene understanding by modeling complex relationships in graph-structured data. Applications include vehicle behavior classification across datasets [100], recognizing dynamic traffic police gestures [101], interpreting these gestures in real-time [102], understanding police intentions from visual cues [103], and recognizing actions of traffic participants in advanced driver-assistance systems [104].
GCN通过对图结构数据中复杂关系的建模,在交通场景理解中表现出色。应用包括跨数据集的车辆行为分类[100]、识别动态交通警察手势[101]、实时解读这些手势[102]、从视觉线索理解警察意图[103],以及在高级驾驶辅助系统中识别交通参与者动作[104]。
The MR-GCN architecture for vehicle behavior classification [100] achieves sensor invariance and high accuracy: 99% on Apollo, 89% on KITTI, and 84% on Indian datasets. Combining spatial scene graphs and LSTM layers, it encodes spatial-temporal dynamics and outperforms baselines, demonstrating robustness across diverse datasets, even with fewer landmarks.
用于车辆行为分类的MR-GCN架构[100]实现了传感器不变性和高准确率:Apollo数据集99%,KITTI数据集89%,印度数据集84%。该方法结合空间场景图和LSTM层,编码时空动态,优于基线方法,展示了在多样数据集上的鲁棒性,即使地标较少。
In [101], a gesture recognition method focuses on dynamic traffic police gestures using a spatial-temporal GCN (ST-GCN) with attention mechanisms and adaptive graph structures. It achieves 87.72% accuracy on the Chinese Traffic Police Gestures (CTPG) dataset, outperforming existing action-recognition methods.
[101]中提出的手势识别方法聚焦于动态交通警察手势,采用带注意力机制和自适应图结构的时空图卷积网络(ST-GCN)。在中国交通警察手势(CTPG)数据集上达到87.72%的准确率,优于现有动作识别方法。
Pose GCN [102] presents an online activity recognition method employing pose estimation and GCNs to interpret traffic police gestures in real-time frames. It achieves a response time of
Pose GCN[102]提出了一种在线活动识别方法,结合姿态估计和GCN实时解读交通警察手势。在TPGR数据集上实现了
In [103], a system for recognizing traffic police intentions from visual cues achieves
[103]中提出的基于视觉线索识别交通警察意图的系统,在TPGR数据集上达到

FIGURE 8. GAT-based license plate detection: An image of a car’s rear undergoes convolution to extract essential features, which are then refined by a GAT layer using an attention mechanism to determine the importance of neighboring features. The GAT operates on a graph representation, where each node is associated with a feature vector, and computes attention weights between nodes to aggregate information effectively. The saliency map is produced by fusing these attention-weighted features, guiding an RPN to accurately localize and identify the license plate. This sophisticated setup, combined with attention mechanisms to compute dynamic weights, enhances detection precision, ensuring reliable and accurate identification of license plates under various conditions. The integration of multiple attention heads helps capture different aspects of neighboring relationships, contributing to robustness in feature refinement.
图8. 基于图注意力网络(GAT)的车牌检测:车辆后部图像经过卷积提取关键特征,随后通过GAT层利用注意力机制确定邻近特征的重要性。GAT在图结构上操作,每个节点关联一个特征向量,计算节点间的注意力权重以有效聚合信息。显著性图通过融合这些加权特征生成,引导区域建议网络(RPN)准确定位和识别车牌。该复杂结构结合动态权重计算的注意力机制,提高了检测精度,确保在各种条件下车牌的可靠准确识别。多头注意力机制帮助捕捉邻居关系的不同方面,增强了特征细化的鲁棒性。
The framework in [104] employs 3D human pose estimation and a dynamic adaptive GCN to recognize actions of traffic police, cyclists, and pedestrians. By optimizing object detection and pose estimation modules, it processes multiple objects simultaneously in real traffic scenarios, achieving
[104]中的框架采用三维人体姿态估计和动态自适应GCN识别交通警察、骑行者和行人的动作。通过优化目标检测和姿态估计模块,能够在真实交通场景中同时处理多个对象,在3D-HPT数据集上达到
GCNs effectively model complex relationships in non-Euclidean data like graphs but face challenges with scalability due to high computational and memory demands on large graphs. They are prone to over-smoothing, where node features lose distinction after multiple layers, and require careful design to capture long-range dependencies, as standard architectures may not naturally handle distant node relationships.
GCN(图卷积网络)有效地建模了非欧几里得数据如图中的复杂关系,但由于大规模图的高计算和内存需求,面临可扩展性挑战。它们容易出现过度平滑现象,即节点特征在多层传播后失去区分性,并且需要精心设计以捕捉长距离依赖,因为标准架构可能无法自然处理远距离节点关系。
GAT,introduced in [106], represents NN architectures for graph-structured data, incorporating attention mechanisms. Using masked self-attentional layers, GATs overcome the limitations of graph convolutions by allowing nodes to assign varying weights to neighbors' features. This avoids computationally expensive matrix operations like inversion and does not require a priori graph structure knowledge.
GAT(图注意力网络),在文献[106]中提出,代表了针对图结构数据的神经网络架构,融合了注意力机制。通过使用掩码自注意力层,GAT克服了图卷积的局限,使节点能够为邻居特征分配不同权重。这避免了计算代价高昂的矩阵运算如求逆,也不需要事先知道图的结构。
Figure 8 illustrates the GAT-based license plate detection process, where the attention mechanism refines features for accurate license plate identification in traffic scenes. For a graph
图8展示了基于GAT的车牌检测过程,其中注意力机制优化特征以实现交通场景中车牌的准确识别。对于图,
The key idea behind GAT is to compute an attention weight for each neighbor of a node and use these weights to aggregate information from the neighbors. GAT employs a self-attention mechanism to calculate the attention weights for each neighboring node
GAT的核心思想是为节点的每个邻居计算注意力权重,并利用这些权重聚合邻居信息。GAT采用自注意力机制计算相对于节点
where
其中
where
其中
where
其中
To capture diverse neighborhood relationships, GAT uses multiple attention heads. Outputs from all heads are concatenated and passed through a learnable weight matrix:
为了捕捉多样的邻域关系,GAT使用多个注意力头。所有头的输出被拼接后通过可学习的权重矩阵:
where
其中
GATs allocate attention within scene graphs, enabling precise object detection and tracking. They are applied to detect license plate numbers in urban environments [107], track pedestrians for traffic safety [108], enhance semantic segmentation, anomaly detection, traffic flow analysis, and improve classification accuracy in complex traffic scenes [109].
GAT在场景图中分配注意力,实现精确的目标检测和跟踪。它们被应用于城市环境中的车牌号码检测[107]、行人跟踪以保障交通安全[108]、提升语义分割、异常检测、交通流量分析,并提高复杂交通场景中的分类准确率[109]。

FIGURE 9. GIN procedure for traffic scene matching: The process begins with “Dataset Preprocessing,” where traffic datasets are converted into road scene graphs via a graph prediction module. This step involves cleaning, filtering, and transforming raw traffic data into a structured format suitable for graph construction. Concurrently, "Query Preprocessing" processes a traffic scene query through "Actor" and "Map Components" clusters, forming a scene graph of the specific road described in the input query. This involves identifying and classifying key elements of the traffic scene, such as vehicles, pedestrians, and road features. These preprocessed graphs are then used to construct the “Input Graph” (Graphx) and the “Isomorphic Matching Subgraph” (Graphy). Graph
图9. GIN(图同构网络)用于交通场景匹配的流程:流程始于“数据集预处理”,通过图预测模块将交通数据集转换为道路场景图。此步骤包括清洗、过滤和转换原始交通数据,使其适合构建图结构。与此同时,“查询预处理”通过“参与者”和“地图组件”聚类处理交通场景查询,形成输入查询所描述道路的场景图。该过程涉及识别和分类交通场景中的关键元素,如车辆、行人和道路特征。预处理后的图用于构建“输入图”(Graphx)和“同构匹配子图”(Graphy)。图
APSEGAT [107] efficiently detects license plate numbers in crowded urban environments with diverse vehicles and complex scenes. It achieves a superior F-Score of
APSEGAT[107]高效检测拥挤城市环境中多样车辆和复杂场景下的车牌号码。在AMLPR数据集上,其F-Score达到
The GAM tracker [108] employs sparse candidate selection, graph attention maps, and distance matching loss for pedestrian tracking, achieving 94.99% MOTA on the Pets-mf dataset. It addresses pedestrian safety challenges and supports traffic statistics and abnormal behavior analysis in intelligent transportation systems.
GAM跟踪器[108]采用稀疏候选选择、图注意力图和距离匹配损失进行行人跟踪,在Pets-mf数据集上实现了94.99%的MOTA。它解决了行人安全问题,并支持智能交通系统中的交通统计和异常行为分析。
SCENE [109] leverages heterogeneous GNNs and graph convolutions to encode diverse traffic scenarios, achieving 91.17 accuracy in binary node classification tasks on a custom large-scale dataset ("GAT_SCENE") with 22,400 sequences, each containing 3 seconds of temporal history. Performance and transferability are notably enhanced by incorporating edge features into the GAT operator.
SCENE[109]利用异构图神经网络(GNN)和图卷积编码多样化交通场景,在包含22400个序列(每个序列包含3秒时间历史)的定制大规模数据集“GAT_SCENE”上,二元节点分类任务准确率达到91.17%。通过将边特征引入图注意力网络(GAT)算子,显著提升了性能和迁移能力。
GATs enhance GCNs by assigning varying importance to neighboring nodes via an attention mechanism for more nuanced relationship modeling. However, this mechanism is computationally expensive on large graphs, prone to overfitting when focusing on a few nodes, and struggles to capture long-range dependencies due to its local application.
图注意力网络(GAT)通过注意力机制为邻居节点分配不同权重,增强了图卷积网络(GCN)对关系的细致建模。然而,该机制在大规模图上计算开销大,聚焦少数节点时易过拟合,且由于其局部应用,难以捕捉长距离依赖。
GIN, introduced in [110], is a GNN designed to process graph-structured data by effectively capturing intricate graph topology. It enhances node representations by considering both a node's features and its neighbors' contributions, enabling precise structural differentiation and identifying subtle graph differences.
GIN由[110]提出,是一种设计用于处理图结构数据的图神经网络,能够有效捕捉复杂的图拓扑结构。它通过同时考虑节点自身特征及其邻居的贡献,增强节点表示,实现精确的结构区分和细微图差异识别。
Figure 9 illustrates the GIN procedure for matching traffic scenes. Let
图9展示了GIN匹配交通场景的过程。设
where
其中,
In some cases, a readout function aggregates node-level information to obtain graph-level embeddings. For example, graph-level embedding
在某些情况下,读出函数将节点级信息聚合以获得图级嵌入。例如,图级嵌入
where
其中,
GIN iteratively updates node representations using their own features and those of neighboring nodes. A learnable parameter
GIN通过迭代更新节点表示,结合节点自身及邻居特征。可学习参数
GINs are widely applied in traffic scene understanding, including road scene-graph embedding [111], vehicle and pedestrian path prediction [112], traffic scene retrieval [113], automatic scenario detection [114], and real-time pedestrian path prediction [115].
GIN广泛应用于交通场景理解,包括道路场景图嵌入[111]、车辆与行人路径预测[112]、交通场景检索[113]、自动场景检测[114]及实时行人路径预测[115]。
Roadscene2vec [111] uses GCNs, GINs, and CNNs to enhance road scene-graph analysis for spatial modeling, graph learning, and risk assessment. For collision prediction, it achieves
Roadscene2vec[111]结合GCN、GIN和卷积神经网络(CNN)提升道路场景图分析,用于空间建模、图学习和风险评估。碰撞预测在271-syn数据集上分别达到
Pishgu [112] introduces a lightweight network combining GINs with attention mechanisms for path prediction. It improves ADE/FDE by up to
Pishgu[112]提出结合GIN与注意力机制的轻量级网络用于路径预测。在ActEV/VIRAT数据集上,车辆(鸟瞰视角)和行人(高角度视角)的平均位移误差(ADE)和最终位移误差(FDE)分别提升了
RSG-Search [113] is a graph-based traffic scene retrieval system using sub-graph isomorphic searching for actor configurations and semantic relationships. It ensures dataset compatibility (e.g., nuScenes, NEDO), achieving full accuracy with low matching times (0-2 seconds). The RSG dataset includes 500 traffic scenes, 200,000 topological graphs, 6 node types (e.g., vehicle, pedestrian), and 25 relationship categories (e.g., 'passing-by', 'waiting-for').
RSG-Search [113] 是一个基于图的交通场景检索系统,利用子图同构搜索来识别参与者配置和语义关系。它确保数据集兼容性(如 nuScenes、NEDO),实现了全准确率且匹配时间低(0-2秒)。RSG 数据集包含500个交通场景、20万个拓扑图、6种节点类型(如车辆、行人)和25种关系类别(如“经过”、“等待”)。
The study in [114] presents expert-knowledge-aided representation learning for traffic scenarios using GIN and an automatic mining strategy. It enables effective clustering and novel scenario detection without manual labeling, achieving an AUC of
文献[114]提出了一种结合专家知识辅助的交通场景表示学习方法,采用图同构网络(GIN)和自动挖掘策略。该方法无需人工标注,实现了有效的聚类和新颖场景检测,在基于OpenStreetMap模拟的数据集上达到了
CARPe [115] introduces a real-time pedestrian path prediction approach by combining GINs with an agile convolutional NN design. It achieves impressive results with an ADE of 0.80 and FDE of 1.48 on the ETH dataset, significantly improving speed and accuracy for applications such as autonomous vehicles and environmental monitoring.
CARPe [115] 引入了一种实时行人路径预测方法,将GIN与灵活的卷积神经网络设计相结合。在ETH数据集上取得了0.80的平均位移误差(ADE)和1.48的最终位移误差(FDE),显著提升了自动驾驶和环境监测等应用的速度和准确性。
GINs excel at distinguishing graph structures by capturing subtle differences between nodes and edges, making them effective for graph classification. However, they are prone to overfitting with limited data, requiring careful hyperpa-rameter tuning. GINs also face scalability and efficiency challenges on large or complex graphs due to their depth and computational demands.
GIN在捕捉节点和边之间细微差异以区分图结构方面表现出色,使其在图分类任务中非常有效。然而,GIN在数据有限时易过拟合,需谨慎调整超参数。由于其深度和计算需求,GIN在处理大型或复杂图时面临可扩展性和效率挑战。
CapsNet, introduced in [116], addresses CNN limitations by effectively handling spatial hierarchies between simple and complex objects. It encapsulates feature information (e.g., pose, texture, deformation) into neuron groups called "capsules," which use dynamic routing for enhanced feature representation and recognition. A capsule's activity vector represents the instantiation parameters of an entity (e.g., object or part), with its length indicating the probability of the entity's presence in the input.
CapsNet由文献[116]提出,旨在解决卷积神经网络(CNN)在处理简单与复杂对象空间层次关系上的局限。它将特征信息(如姿态、纹理、变形)封装到称为“胶囊”的神经元组中,利用动态路由机制增强特征表示和识别能力。胶囊的活动向量表示实体(如对象或部件)的实例化参数,其长度表示该实体在输入中存在的概率。
To ensure the output vector of a capsule is a small-length vector if the probability of the entity being present is low and a long-length vector if it is high, a squashing function is used. It is typically defined as:
为了确保当实体存在概率低时胶囊输出向量长度较短,概率高时长度较长,采用了挤压函数。其定义通常为:
where
其中
Dynamic routing allows a capsule to send its output to parent capsules in the next layer based on prediction agreement. Each lower-layer capsule predicts the output of higher-layer capsules using a transformation matrix
动态路由允许胶囊根据预测一致性将输出发送给下一层的父胶囊。每个低层胶囊使用变换矩阵
Capsules send outputs to parent capsules based on "routing by agreement," measured by the scalar product between the prediction vector and the parent's output vector. The coupling coefficients
胶囊基于“路由一致性”将输出发送给父胶囊,该一致性通过预测向量与父胶囊输出向量的点积衡量。耦合系数
The total input
第
Algorithm 1 shows the dynamic routing procedure to train the coupling coefficients
算法1展示了训练每个第

FIGURE 10. Traffic scene classification with CapsNet: The input image is first processed through a ReLU convolution layer, followed by 32 primary capsules (’PrimaryCaps’). Capsules, fundamental units of CapsNets, encapsulate feature information like pose and texture, with the length of each capsule’s output vector representing the entity’s presence probability. These primary capsules are linked to 8 traffic capsules (‘TrafficCaps’) via weight matrices
图10. 使用CapsNet进行交通场景分类:输入图像首先经过ReLU卷积层处理,随后进入32个初级胶囊(‘PrimaryCaps’)。胶囊作为CapsNet的基本单元,封装了姿态和纹理等特征信息,每个胶囊输出向量的长度表示实体存在的概率。这些初级胶囊通过权重矩阵
Algorithm 1 Routing Algorithm to Train the Coupling Coefficients Between Each Primary and Output Capsule
算法1 训练每个初级胶囊与输出胶囊之间耦合系数的路由算法
procedure
过程
for all capsule \( i \) in layer \( l \) and capsule \( j \) in layer \( \left( {l + 1}\right) \) :
对于层 \( l \) 中的所有胶囊 \( i \) 和层 \( \left( {l + 1}\right) \) 中的胶囊 \( j \) :
for \( r \) iterations do
进行 \( r \) 次迭代
for all capsule \( i \) in layer \( l : {c}_{i} \leftarrow \operatorname{softmax}\left( {b}_{i}\right) \)
对于层 \( l : {c}_{i} \leftarrow \operatorname{softmax}\left( {b}_{i}\right) \) 中的所有胶囊 \( i \)
{softmax computes Equation 35}
{softmax 计算方程35}
for all capsule \( j \) in layer \( \left( {l + 1}\right) : {s}_{j} \leftarrow \mathop{\sum }\limits_{i}\left( {{c}_{ij} \cdot {\widehat{u}}_{j} \mid i}\right) \)
对于层 \( \left( {l + 1}\right) : {s}_{j} \leftarrow \mathop{\sum }\limits_{i}\left( {{c}_{ij} \cdot {\widehat{u}}_{j} \mid i}\right) \) 中的所有胶囊 \( j \)
for all capsule \( j \) in layer \( \left( {l + 1}\right) : {v}_{j} \leftarrow \operatorname{squash}\left( {s}_{j}\right) \)
对于层 \( \left( {l + 1}\right) : {v}_{j} \leftarrow \operatorname{squash}\left( {s}_{j}\right) \) 中的所有胶囊 \( j \)
{squash computes Equation 34}
{squash 计算方程34}
for all capsule \( i \) in layer \( l \) and capsule \( j \) in layer
对于层 \( l \) 中的所有胶囊 \( i \) 和层中的胶囊 \( j \)
return
返回
These equations and the architecture help preserve detailed spatial information and enable the network to better understand the relationships and hierarchies between different parts of the objects.
这些方程和架构有助于保留详细的空间信息,使网络能够更好地理解物体不同部分之间的关系和层次结构。
Figure 10 illustrates the CapsNet procedure for traffic scene classification. The third layer, Traffic Capsules (Traf-ficCaps), contains 8 capsules, each as a 16-dimensional vector, fully connected to the previous layer's capsules. Dynamic routing ensures layer communication, while a squashing function bounds output vector lengths between 0 and 1 , indicating entity presence probabilities. Weight matrices
图10展示了用于交通场景分类的胶囊网络(CapsNet)过程。第三层“交通胶囊”(Traffic Capsules,TrafficCaps)包含8个胶囊,每个为16维向量,完全连接到前一层的胶囊。动态路由确保层间通信,挤压函数(squashing function)将输出向量长度限制在0到1之间,表示实体存在的概率。权重矩阵
CapsNets outperform traditional CNNs in robustness and generalization, excelling at capturing spatial relationships and complex interactions in traffic scenes, such as congested intersections. Applications include traffic sign detection [117], highway scene segmentation [118], and complex scenario recognition [119], offering deeper and more reliable insights into dynamic road environments.
胶囊网络(CapsNets)在鲁棒性和泛化能力上优于传统卷积神经网络(CNN),擅长捕捉交通场景中的空间关系和复杂交互,如拥堵路口。应用包括交通标志检测[117]、高速公路场景分割[118]和复杂场景识别[119],为动态道路环境提供更深入、更可靠的洞察。
TSDCaps [117] addresses CNN limitations for traffic sign detection,achieving
TSDCaps [117] 针对交通标志检测中CNN的局限性, 在GTSRB数据集上实现了
A scene segmentation model in [118], trained on Auckland Highway Images (AHI), achieves 74.61% accuracy. It enhances scene comprehension using matrix representations for pose and spatial relationships, reduces manual data manipulation, and addresses the challenging Picasso problem.
[118] 中的场景分割模型在奥克兰高速公路图像(Auckland Highway Images,AHI)上训练,达到74.61%的准确率。该模型利用矩阵表示姿态和空间关系,提升了场景理解,减少了手动数据处理,并解决了具有挑战性的毕加索问题。
The authors of [119] proposed "ImprovedCaps," a two-step approach for complex scenes. It enhances traffic sign features through image processing before applying CapsNet for recognition, improving GTSRB accuracy by 2%- 5% in complex scenarios and achieving 96% overall accuracy.
[119] 的作者提出了“ImprovedCaps”,一种针对复杂场景的两步方法。通过图像处理增强交通标志特征后应用胶囊网络进行识别,在复杂场景中将GTSRB准确率提升了2%-5%,整体准确率达到96%。
In [120], "LiuCaps" is introduced for traffic-light sign recognition in autonomous vehicles. Trained on the TL_Dataset, it achieves 98.72% accuracy and a 99.27% F1-score, outperforming traditional CNNs while reducing training data needs and improving spatial relationship handling.
在[120]中,"LiuCaps"被引入用于自动驾驶车辆中的交通信号灯识别。该模型在TL_Dataset上训练,达到98.72%的准确率和99.27%的F1分数,优于传统卷积神经网络(CNN),同时减少了训练数据需求并提升了空间关系处理能力。
CapsNets excel at capturing spatial hierarchies and pose relationships, providing detailed object structure understanding compared to CNNs. However, they are computationally intensive, memory-demanding, and challenging to train, requiring complex optimization. Their limited scalability and efficiency on large datasets hinder adoption in resource-critical, large-scale applications.
胶囊网络(CapsNets)擅长捕捉空间层次结构和姿态关系,相较于卷积神经网络(CNN)提供了更详尽的物体结构理解。然而,它们计算量大、内存需求高且训练复杂,需复杂的优化方法。其在大规模数据集上的可扩展性和效率有限,阻碍了在资源受限的大规模应用中的推广。
HPO optimizes model architecture and parameters like learning rate, network structure, and regularization, enhancing accuracy, generalization, and efficiency across datasets. Fine-tuning learning rates, anchor box dimensions (R-CNN, YOLO), and architectures (DETR, ViT) improves performance and pixel-level accuracy. In graph learning, tuning GCN layers and node features enhances spatial and structural relationship capture, improving predictions. Techniques like dropout, early stopping, and batch size optimization combat overfitting and computational constraints, while task-specific tuning strengthens multitask learning and temporal modeling for traffic scene understanding.
超参数优化(HPO)通过调整学习率、网络结构和正则化等模型架构和参数,提升模型在各数据集上的准确性、泛化能力和效率。微调学习率、锚框尺寸(R-CNN、YOLO)和架构(DETR、ViT)可提升性能和像素级准确度。在图学习中,调整图卷积网络(GCN)层数和节点特征增强空间及结构关系的捕捉能力,改善预测效果。采用丢弃法(dropout)、早停(early stopping)和批量大小优化等技术可防止过拟合和计算瓶颈,而任务特定的调优则强化多任务学习和交通场景的时序建模。
The learning rate controls weight updates: higher rates accelerate training but risk instability; lower rates improve precision but slow convergence. Batch size affects stability and memory: larger sizes enhance stability; smaller ones improve generalization. Epochs determine dataset passes: more epochs improve learning but risk overfitting, while fewer reduce training time but may underfit. Momentum accelerates training by smoothing updates, reducing oscillations, and avoiding local minima; higher momentum speeds convergence but risks overshooting, while lower momentum ensures stability. Weight decay (L2 regularization) prevents overfitting by penalizing large weights, promoting simpler models: higher values reduce overfitting but may underfit, while lower values allow flexibility but risk overfitting. Anchor scale adjusts predefined box sizes in object detection: larger scales improve the detection of big objects, while smaller scales enhance accuracy for small objects.
学习率控制权重更新:较高学习率加快训练但可能不稳定;较低学习率提高精度但收敛较慢。批量大小影响稳定性和内存:较大批量提升稳定性;较小批量增强泛化能力。训练轮数(epoch)决定数据集遍历次数:更多轮数提升学习但易过拟合,较少轮数缩短训练时间但可能欠拟合。动量加速训练,通过平滑更新减少振荡,避免陷入局部最优;较高动量加快收敛但可能超调,较低动量保证稳定。权重衰减(L2正则化)通过惩罚大权重防止过拟合,促进模型简化:较高值减少过拟合但可能欠拟合,较低值灵活但易过拟合。锚框尺度调整目标检测中预定义框大小:较大尺度提升大目标检测,较小尺度提高小目标准确率。
In the reviewed Fast R-CNN models, [22] used SGD with mini-batches of 2 images, 128 RoIs, a learning rate of 0.001 decayed after
在所评述的Fast R-CNN模型中,[22]使用了小批量为2张图像、128个感兴趣区域(RoIs)、学习率为0.001且在
For Faster R-CNN,[34] modified anchor scales
对于Faster R-CNN,[34]调整了锚框尺度
In Mask R-CNN, [51] set the learning rate to 0.02 , weight decay to 0.0001, momentum to 0.9, batch size of 16, and trained for
在Mask R-CNN中,[51]设置学习率为0.02,权重衰减为0.0001,动量为0.9,批量大小16,训练
For YOLO models, [121] used a learning rate of 0.001 decayed by 0.1 per epoch, momentum of 0.9 , weight decay of 0.0005, 8000 max_batches, and batch size of 32 . Reference [122] used a learning rate of 0.01 , final rate of 0.2 , momentum of 0.937, weight decay of 0.0005, 110 epochs, and batch size of 12. Reference [123] utilized a learning rate of 0.0002,Adam optimizer with
对于YOLO模型,[121]使用学习率0.001,每个epoch衰减0.1,动量0.9,权重衰减0.0005,最大批次数8000,批量大小32。参考文献[122]采用学习率0.01,最终学习率0.2,动量0.937,权重衰减0.0005,训练110个epoch,批量大小12。参考文献[123]使用学习率0.0002,Adam优化器,
Among ViT models,[89] used
在ViT模型中,[89]使用
For DETR models, [94] used ResNet-50/101 backbones, a transformer with 6 encoder/decoder layers, 256 hidden dimensions,8 attention heads,AdamW optimizer with
对于DETR模型,[94]使用了ResNet-50/101主干网络,包含6层编码器/解码器的变换器,256维隐藏层,8个注意力头,AdamW优化器,学习率为
Here's a shorter version that retains all the key information:
以下是保留所有关键信息的简短版本:
Table 2 compares various discriminative DL architectures. In classification, models are evaluated by overall accuracy for traffic scene recognition tasks. The Receptive Field NN [6], an early DL application for road sign classification, achieved a modest
表2比较了各种判别式深度学习架构。在分类任务中,模型通过整体准确率评估交通场景识别性能。感受野神经网络(Receptive Field NN)[6]作为早期用于道路标志分类的深度学习应用,在自定义数据集(“RFNN_TSR”)上取得了
For object detection, Table 2 highlights performance improvements of Fast R-CNN over standard R-CNN, with AllLightRCNN [32] achieving 94.20% mean accuracy ("All-LightRCNN_DS"), a 16.4% gain over R-CNN's 65.76% mAP ("RCNNs_Detection"). Mask R-CNN and Faster R-CNN achieved mAPs of 74.30% and 76.30% [57]. Faster R-CNN models on COCO 2017 achieved AP scores from 40.2% to 44.0%, with Faster R-CNN-FPN-R101+ (108 epochs) [97] obtaining the highest AP of
在目标检测方面,表2突出显示了Fast R-CNN相较于标准R-CNN的性能提升,AllLightRCNN [32]在“All-LightRCNN_DS”数据集上实现了94.20%的平均准确率(mAP),比R-CNN的65.76%(“RCNNs_Detection”)提升了16.4%。Mask R-CNN和Faster R-CNN分别达到了74.30%和76.30%的mAP [57]。Faster R-CNN模型在COCO 2017数据集上的AP得分介于40.2%至44.0%之间,其中Faster R-CNN-FPN-R101+(108个周期)[97]取得了最高的AP为
Here's a shorter version that preserves all key details:
以下是保留所有关键细节的简短版本:
For segmentation tasks, the CNN-based SNE-RoadSeg [8] achieved 98.6% accuracy on the R2D dataset, demonstrating high performance for road segmentation. Real-world applications include flood segmentation, where Mask R-CNN [52] achieved 93.0% accuracy on the IDRF dataset [58]. CapsNet-based methods, like U-Net [118], achieved an IoU of
在分割任务中,基于CNN的SNE-RoadSeg [8]在R2D数据集上达到了98.6%的准确率,展示了道路分割的高性能。实际应用包括洪水分割,Mask R-CNN [52]在IDRF数据集[58]上实现了93.0%的准确率。基于胶囊网络的方法,如U-Net [118],在AHI数据集的车辆相关场景分割中取得了
In traffic action recognition, CNN-based methods achieved modest results,with CPM [102] reaching 63.98% accuracy on the TPGR dataset. GCN-based models, such as Pose GCN [102] and ST-GCN [101], significantly outperformed CNNs with accuracies of 87.72% and 97.52%, respectively, due to their ability to capture spatial-temporal relationships and structural representations.
在交通动作识别中,基于CNN的方法取得了适中结果,CPM [102]在TPGR数据集上达到了63.98%的准确率。基于图卷积网络(GCN)的模型,如Pose GCN [102]和ST-GCN [101],分别以87.72%和97.52%的准确率显著超越了CNN,得益于其捕捉时空关系和结构表示的能力。
TABLE 2. Comparison of discriminative DL models for traffic scene understanding across applications, frameworks, datasets, metrics, and results.
表2. 交通场景理解中判别式深度学习模型在应用、框架、数据集、指标和结果方面的比较。
| Application | Framework | Variance | Dataset | Performance Metric | Result |
| Classification | Vanilla CNN | Receptive Field NN [6] | RFNN_TSR | Accuracy | 47.7% |
| Multi-scale CNN | 2LConvNet ms 108-108 [7] | TL_Dataset | Accuracy | 97.83% | |
| CNN | ResNet-50 [111] | 1043-syn | Accuracy | 90.53% | |
| GCN | MR-GCN [100] | KITTI | Accuracy | 89% | |
| GCN | HetEdgeGCN [109] | GAT_SCENE | Accuracy | 93.54% | |
| GCN | HetEdgeGatedGCN [109] | GAT_SCENE | Accuracy | 90.09% | |
| GCN | MRGCN [111] | 1043-syn | Accuracy | 95.80% | |
| GAT | HetEdgeGAT [109] | GAT_SCENE | Accuracy | 94.29% | |
| GIN | MRGIN [111] | 1043-syn | Accuracy | 87.84% | |
| CapsNet | ImprovedCaps [119] | GTSRB | Accuracy | 96% | |
| CapsNet | LiuCaps [120] | TL_Dataset | Accuracy | 98.72% | |
| Object Detection | CNN | ResNet50 [57] | RCNNs_Detection | mAP | 65.76% |
| R-CNN | VGG16 [10] | VOC2007 | mAP | 66.0% | |
| R-CNN | ZF, VGG16 [32] | AllLightRCNN_DS | Mean Accuracy | 77.8% | |
| Fast R-CNN | AllLightRCNN [32] | AllLightRCNN_DS | Mean Accuracy | 94.20% | |
| Mask R-CNN | ME Mask R-CNN [54] | TrainObstacle | mAP | 91.3% | |
| Mask R-CNN | Mask R-CNN [57] | RCNNs_Detection | mAP | 74.30% | |
| Faster R-CNN | Faster R-CNN [57] | RCNNs_Detection | mAP | 76.30% | |
| Faster R-CNN | ResNet-50 [73] | ShokriCollection_DS | AP | 54.69% | |
| Faster R-CNN | Inception v2 [84] | PSU | AP | 73.9% | |
| Faster R-CNN | Faster R-CNN-FPN-R50 (36 epochs) [97] | COCO 2017 | AP | 40.2% | |
| Faster R-CNN | Faster R-CNN-FPN-R50++ (108 epochs) [97] | COCO 2017 | AP | 42.0% | |
| Faster R-CNN | Faster R-CNN-FPN-R101 (36 epochs) [97] | COCO 2017 | AP | 42.0% | |
| Faster R-CNN | Faster R-CNN-FPN-R101+ (108 epochs) [97] | COCO 2017 | AP | 44.0% | |
| YOLOv1 | A custom CNN [63] | LISA-dayTrain | AUC | 58.3% | |
| YOLOv2 | Darknet-19 [63] | LISA-dayTrain | AUC | 60.05% | |
| YOLOv3 | Darknet-53 [63] | LISA-dayTrain | AUC | 90.49% | |
| YOLOv3 | Darknet-53 [84] | PSU | AP | 96.5% | |
| YOLOv4 | CSPDarknet53-PANet-SPP [84] | PSU | AP | 96.5% | |
| YOLOv5 | modified CSPDarknet53 [73] | ShokriCollection_DS | AP | 93.85% | |
| YOLOv6 | EfficientRep [73] | ShokriCollection_DS | AP | 92.95% | |
| YOLOv7 | No pretrained backbone [73] | ShokriCollection_DS | AP | 98.77% | |
| YOLOv8 | A CSPDarknet variant [73] | ShokriCollection_DS | AP | 91.23% | |
| ViT | Vanilla ViT [89] | ViT_DS | F1-score | 92.10% | |
| ViT | ViT-SSA [89] | ViT DS | F1-score | 98.07% | |
| ViT | ViT-TA [91] | DAD | F1-score | 94% | |
| DETR | DSRA-DETR [95] | CCTSDB | AP | 78.24% | |
| DETR | MTSDet [96] | CTSD | mAP | 94.3% | |
| DETR | DETR-R50 (500 epochs) [97] | COCO 2017 | AP | 42.0% | |
| DETR | DETR-DC5-R50 (500 epochs) [97] | COCO 2017 | AP | 43.3% | |
| DETR | Deformable DETR-R50, Single-scale [97] | COCO 2017 | AP | 39.7% | |
| DETR | Deformable DETR-R50 (150 epochs) [97] | COCO 2017 | AP | 45.3% | |
| DETR | UP-DETR-R50 (150 epochs) [97] | COCO 2017 | AP | 40.5% | |
| DETR | UP-DETR-R50+ (300 epochs) [97] | COCO 2017 | AP | 42.8% | |
| DETR | SMCA-R50 (108 epochs) [97] | COCO 2017 | AP | 45.6% | |
| DETR | DETR-R101 (500 epochs) [97] | COCO 2017 | AP | 43.5% | |
| DETR | DETR-DC5-R101 (500 epochs) [97] | COCO 2017 | AP | 44.9% | |
| DETR | SMCA-R101 (50 epochs) [97] | COCO 2017 | AP | 44.4% | |
| DETR | DetectFormer [98] | BCTSDB | AP75 | 91.4% | |
| GAT | APSEGAT [107] | AMLPR | F-Score | 90% | |
| CapsNet | TSDCaps [117] | GTSRB | Accuracy | 97.62% | |
| Segmentation | CNN | SNE-RoadSeg [8] | R2D | Accuracy | 98.6% |
| Mask-R-CNN | Mask-R-CNN [52] | IDRF [58] | accuracy | 93.0% | |
| CapsNet | U-Net [118] | AHI | IoU | 74.61% | |
| Action Recognition | CNN | CPM [102] | TPGR | Accuracy | 63.98% |
| ViT | Action-ViT [90] | JAAD | F1-score | 90.2% | |
| GCN | ST-GCN [101] | CTPG | Accuracy | 87.72% | |
| GCN | Pose GCN [102] | TPGR | Accuracy | 97.52% | |
| GCN | OpenPose [103] | TPGR | Accuracy | 87.72% | |
| GCN | DA-GCN [104] | TPGR | Accuracy | 94.70% | |
| Object Tracking | YOLOv3 | Deep SORT [108] | Pets-mf | MOTA | 92.08% |
| GAT | GAM tracker [108] | Pets-mf | MOTA | 94.99% | |
| Path Prediction | GIN | Pishgu [112] | ActEV/VIRAT | ADE, FDE | 14.11, 27.96 |
| GIN | CARPe [115] | ETH | ADE, FDE | 0.80,1.48 | |
| Scene Retrieval | GIN | VF2 Without Optimization [113] | RSG | Matching Time (s) | 0-5,000 |
| GIN | VF2 With Optimization [113] | RSG | Matching Time (s) | 0-2.5 | |
| GIN | GNN-based Matching [113] | RSG | Matching Time (s) | 0-2.0 | |
| Novel Scenario Detection | ViT | ViT-L [92] | Wurst_DS [93] | AUC | 95.6% |
| GIN | Expert-LaSTS [114] | OpenStreetMap | AUC | 99.1% |
| 应用 | 框架 | 方差 | 数据集 | 性能指标 | 结果 |
| 分类 | 基础卷积神经网络(Vanilla CNN) | 感受野神经网络(Receptive Field NN)[6] | RFNN_TSR | 准确率 | 47.7% |
| 多尺度卷积神经网络(Multi-scale CNN) | 2层卷积网络 ms 108-108 [7] | TL_数据集 | 准确率 | 97.83% | |
| 卷积神经网络(CNN) | ResNet-50 [111] | 1043-合成 | 准确率 | 90.53% | |
| 图卷积网络(GCN) | MR-GCN [100] | KITTI | 准确率 | 89% | |
| 图卷积网络(GCN) | 异构边图卷积网络(HetEdgeGCN)[109] | GAT_场景 | 准确率 | 93.54% | |
| 图卷积网络(GCN) | 异构边门控图卷积网络(HetEdgeGatedGCN)[109] | GAT_场景 | 准确率 | 90.09% | |
| 图卷积网络(GCN) | MRGCN [111] | 1043-合成 | 准确率 | 95.80% | |
| 图注意力网络(GAT) | 异构边图注意力网络(HetEdgeGAT)[109] | GAT_场景 | 准确率 | 94.29% | |
| 图同构网络(GIN) | MRGIN [111] | 1043-合成 | 准确率 | 87.84% | |
| 胶囊网络(CapsNet) | 改进胶囊网络(ImprovedCaps)[119] | 德国交通标志识别基准(GTSRB) | 准确率 | 96% | |
| 胶囊网络(CapsNet) | 刘氏胶囊网络(LiuCaps)[120] | TL_数据集 | 准确率 | 98.72% | |
| 目标检测 | 卷积神经网络(CNN) | ResNet50 [57] | RCNNs_检测 | 平均精度均值(mAP) | 65.76% |
| 区域卷积神经网络(R-CNN) | VGG16 [10] | VOC2007 | 平均精度均值(mAP) | 66.0% | |
| 区域卷积神经网络(R-CNN) | ZF,VGG16 [32] | AllLightRCNN_数据集 | 平均准确率 | 77.8% | |
| 快速区域卷积神经网络(Fast R-CNN) | AllLightRCNN [32] | AllLightRCNN_数据集 | 平均准确率 | 94.20% | |
| 掩码区域卷积神经网络(Mask R-CNN) | ME 掩码区域卷积神经网络(ME Mask R-CNN)[54] | 火车障碍物 | 平均精度均值(mAP) | 91.3% | |
| 掩码区域卷积神经网络(Mask R-CNN) | 掩码区域卷积神经网络(Mask R-CNN)[57] | RCNNs_检测 | 平均精度均值(mAP) | 74.30% | |
| 更快区域卷积神经网络(Faster R-CNN) | 更快区域卷积神经网络(Faster R-CNN)[57] | RCNNs_检测 | 平均精度均值(mAP) | 76.30% | |
| 更快区域卷积神经网络(Faster R-CNN) | ResNet-50 [73] | Shokri集合_数据集 | 平均精度(AP) | 54.69% | |
| 更快区域卷积神经网络(Faster R-CNN) | Inception v2 [84] | 宾夕法尼亚州立大学(PSU) | 平均精度(AP) | 73.9% | |
| 更快区域卷积神经网络(Faster R-CNN) | 更快区域卷积神经网络-FPN-R50(36轮)[97] | COCO 2017 | 平均精度(AP) | 40.2% | |
| 更快区域卷积神经网络(Faster R-CNN) | 更快区域卷积神经网络-FPN-R50++(108轮)[97] | COCO 2017 | 平均精度(AP) | 42.0% | |
| 更快区域卷积神经网络(Faster R-CNN) | 更快区域卷积神经网络-FPN-R101(36轮)[97] | COCO 2017 | 平均精度(AP) | 42.0% | |
| 更快区域卷积神经网络(Faster R-CNN) | 更快区域卷积神经网络-FPN-R101+(108轮)[97] | COCO 2017 | 平均精度(AP) | 44.0% | |
| YOLOv1 | 自定义卷积神经网络 [63] | LISA-白天训练 | 曲线下面积(AUC) | 58.3% | |
| YOLOv2 | Darknet-19 [63] | LISA-白天训练 | 曲线下面积(AUC) | 60.05% | |
| YOLOv3 | Darknet-53 [63] | LISA-白天训练 | 曲线下面积(AUC) | 90.49% | |
| YOLOv3 | Darknet-53 [84] | 宾夕法尼亚州立大学(PSU) | 平均精度(AP) | 96.5% | |
| YOLOv4 | CSPDarknet53-PANet-SPP [84] | 宾夕法尼亚州立大学(PSU) | 平均精度(AP) | 96.5% | |
| YOLOv5 | 改进的 CSPDarknet53 [73] | Shokri集合_数据集 | 平均精度(AP) | 93.85% | |
| YOLOv6 | EfficientRep [73] | Shokri集合_数据集 | 平均精度(AP) | 92.95% | |
| YOLOv7 | 无预训练主干网络 [73] | Shokri集合_数据集 | 平均精度(AP) | 98.77% | |
| YOLOv8 | CSPDarknet 变体 [73] | Shokri集合_数据集 | 平均精度(AP) | 91.23% | |
| ViT | 原版 ViT (Vision Transformer) [89] | ViT_DS | F1分数 | 92.10% | |
| ViT | ViT-SSA [89] | ViT DS | F1分数 | 98.07% | |
| ViT | ViT-TA [91] | DAD | F1分数 | 94% | |
| DETR | DSRA-DETR [95] | CCTSDB | 平均精度(AP) | 78.24% | |
| DETR | MTSDet [96] | CTSD | 平均精度均值(mAP) | 94.3% | |
| DETR | DETR-R50(500轮)[97] | COCO 2017 | 平均精度(AP) | 42.0% | |
| DETR | DETR-DC5-R50(500轮)[97] | COCO 2017 | 平均精度(AP) | 43.3% | |
| DETR | 可变形 DETR-R50,单尺度 [97] | COCO 2017 | 平均精度(AP) | 39.7% | |
| DETR | 可变形 DETR-R50(150轮)[97] | COCO 2017 | 平均精度(AP) | 45.3% | |
| DETR | UP-DETR-R50(150轮)[97] | COCO 2017 | 平均精度(AP) | 40.5% | |
| DETR | UP-DETR-R50+(300轮)[97] | COCO 2017 | 平均精度(AP) | 42.8% | |
| DETR | SMCA-R50(108轮)[97] | COCO 2017 | 平均精度(AP) | 45.6% | |
| DETR | DETR-R101(500轮)[97] | COCO 2017 | 平均精度(AP) | 43.5% | |
| DETR | DETR-DC5-R101(500轮)[97] | COCO 2017 | 平均精度(AP) | 44.9% | |
| DETR | SMCA-R101(50轮)[97] | COCO 2017 | 平均精度(AP) | 44.4% | |
| DETR | DetectFormer [98] | BCTSDB | AP75 | 91.4% | |
| 图注意力网络(GAT) | APSEGAT [107] | AMLPR | F分数 | 90% | |
| 胶囊网络(CapsNet) | TSDCaps [117] | 德国交通标志识别基准(GTSRB) | 准确率 | 97.62% | |
| 分割 | 卷积神经网络(CNN) | SNE-RoadSeg [8] | R2D | 准确率 | 98.6% |
| Mask-R-CNN | Mask-R-CNN [52] | IDRF [58] | 准确率 | 93.0% | |
| 胶囊网络(CapsNet) | U-Net [118] | AHI | 交并比 (IoU) | 74.61% | |
| 动作识别 | 卷积神经网络(CNN) | CPM [102] | TPGR | 准确率 | 63.98% |
| ViT | Action-ViT [90] | JAAD | F1分数 | 90.2% | |
| 图卷积网络(GCN) | ST-GCN [101] | CTPG | 准确率 | 87.72% | |
| 图卷积网络(GCN) | Pose GCN [102] | TPGR | 准确率 | 97.52% | |
| 图卷积网络(GCN) | OpenPose [103] | TPGR | 准确率 | 87.72% | |
| 图卷积网络(GCN) | DA-GCN [104] | TPGR | 准确率 | 94.70% | |
| 目标跟踪 | YOLOv3 | Deep SORT [108] | Pets-mf | MOTA(多目标跟踪准确率) | 92.08% |
| 图注意力网络(GAT) | GAM 跟踪器 [108] | Pets-mf | MOTA(多目标跟踪准确率) | 94.99% | |
| 路径预测 | 图同构网络(GIN) | Pishgu [112] | ActEV/VIRAT | ADE(平均位移误差),FDE(最终位移误差) | 14.11, 27.96 |
| 图同构网络(GIN) | CARPe [115] | ETH | ADE(平均位移误差),FDE(最终位移误差) | 0.80,1.48 | |
| 场景检索 | 图同构网络(GIN) | 无优化的 VF2 算法 [113] | RSG | 匹配时间(秒) | 0-5,000 |
| 图同构网络(GIN) | 优化后的 VF2 算法 [113] | RSG | 匹配时间(秒) | 0-2.5 | |
| 图同构网络(GIN) | 基于图神经网络(GNN)的匹配 [113] | RSG | 匹配时间(秒) | 0-2.0 | |
| 新颖场景检测 | ViT | ViT-L [92] | Wurst_DS [93] | 曲线下面积(AUC) | 95.6% |
| 图同构网络(GIN) | Expert-LaSTS [114] | OpenStreetMap(开放街图) | 曲线下面积(AUC) | 99.1% |
For object tracking, Deep SORT with YOLOv3 [108] achieved a MOTA of 92.08% on the Pets-mf dataset, while the GAM tracker [108] improved MOTA to 94.99%, demonstrating the impact of attention mechanisms. In scene retrieval, GIN-based models like VF2 and GNN-based Matching [113] achieved complete accuracy with retrieval times of 0-2 seconds. For novel scenario detection, ViT-L [92] and Expert-LaSTS [114] achieved AUCs of 95.6% and
在目标跟踪方面,结合YOLOv3的Deep SORT [108] 在Pets-mf数据集上实现了92.08%的MOTA,而GAM跟踪器 [108] 将MOTA提升至94.99%,展示了注意力机制的影响。在场景检索中,基于GIN的模型如VF2和基于GNN的匹配 [113] 实现了100%的准确率,检索时间为0-2秒。对于新颖场景检测,ViT-L [92] 和Expert-LaSTS [114] 分别达到了95.6%和
Generative machine learning models are growing increasingly integral in advancing DL for traffic scene understanding. Unlike discriminative models that differentiate between distinct entities, generative models excel in generating new data instances that mimic real-world scenarios. These models are adept at creating realistic images and simulations that can be invaluable in traffic scene analysis.
生成式机器学习模型在推动深度学习(DL)用于交通场景理解方面日益重要。与区分式模型区分不同实体不同,生成式模型擅长生成模拟真实场景的新数据实例。这些模型能够创造逼真的图像和仿真,对于交通场景分析极具价值。
In traffic scene understanding, generative models find applications in synthesizing diverse and complex traffic scenarios for training and evaluation purposes. They can generate varied environmental conditions, and lighting variations, including rare traffic occurrences, offering a robust and comprehensive dataset for training discriminative models. This enhances the ability of DNNs to interpret and respond accurately to dynamic traffic situations. Additionally, generative models can be used in anomaly detection, where they help identify unusual or hazardous traffic conditions by contrasting them with the normative patterns they have learned.
在交通场景理解中,生成式模型应用于合成多样且复杂的交通场景,用于训练和评估。它们可以生成多变的环境条件和光照变化,包括罕见的交通事件,提供强大且全面的数据集以训练区分式模型,提升深度神经网络(DNN)准确解读和响应动态交通状况的能力。此外,生成式模型还可用于异常检测,通过与其学习的正常模式对比,帮助识别异常或危险的交通状况。
The following sections discuss generative ML models shaping traffic scene understanding, from basic GANs to complex hybrids blending generative and discriminative techniques. These models address data generation, realism enhancement, and scenario simulation, advancing DL in intelligent transportation systems. Additionally, we explore HPO for these architectures and evaluate their performance metrics, offering a comprehensive overview.
以下章节讨论塑造交通场景理解的生成式机器学习模型,从基础的生成对抗网络(GAN)到融合生成与区分技术的复杂混合模型。这些模型解决数据生成、真实感增强和场景仿真问题,推动智能交通系统中的深度学习发展。同时,我们探讨这些架构的超参数优化(HPO)并评估其性能指标,提供全面概述。
A GAN, as introduced first in [124], consists of two NNs, a generator and a discriminator, which are trained simultaneously through adversarial training. The generator creates new data instances that resemble a given dataset, while the discriminator evaluates them for authenticity. This process leads the generator to produce increasingly realistic data samples.
生成对抗网络(GAN)最早由文献[124]提出,由两个神经网络组成:生成器和判别器,通过对抗训练同时优化。生成器创造与给定数据集相似的新数据实例,判别器评估其真实性。该过程促使生成器产出越来越逼真的数据样本。
Figure 11 illustrates the training process of a GAN for application to a traffic scene understanding problem. The generator (G) creates fake data samples from random noise. It can be represented as a function
图11展示了GAN在交通场景理解问题中的训练过程。生成器(G)从随机噪声生成假数据样本。其可表示为函数
The discriminator (D) evaluates the authenticity of a given data sample, determining whether it is real (from the actual dataset) or fake (generated by the generator). It can be represented as a function
判别器(D)评估给定数据样本的真实性,判断其是真实数据集中的样本还是生成器生成的假样本。其可表示为函数
The training objective of a GAN is to minimize the following value function, also known as the minimax loss:
GAN的训练目标是最小化以下值函数,也称为极小极大损失:
where
其中
This objective creates a dynamic similar to a tug-of-war between the generator and the discriminator. The generator aims to minimize this objective, while the discriminator seeks to maximize it.
该目标形成了生成器与判别器之间类似拔河的动态。生成器旨在最小化该目标,而判别器则试图最大化它。
In the training process, the generator and discriminator are updated iteratively. The generator learns to produce more realistic data to fool the discriminator, while the discriminator improves its ability to differentiate between real and fake data.
在训练过程中,生成器和判别器交替更新。生成器学习生成更逼真的数据以欺骗判别器,判别器则提升区分真实与假数据的能力。
This adversarial process continues until the generated data is indistinguishable from real data, or until a stopping criterion is met.
该对抗过程持续进行,直到生成数据与真实数据无法区分,或达到停止条件。
GANs are utilized for spatio-temporal traffic state reconstruction [125], enhancing video frame predictions [126] and aiding semantic segmentation [127] for autonomous vehicles, augmenting training data for rare events [128], synthesizing soiling on fisheye camera images [129], improving highway traffic images [130] and road segmentation [131] in adverse weather, and augmenting data to improve classifier generalization [132].
GAN被用于时空交通状态重建 [125]、提升视频帧预测 [126]、辅助自动驾驶车辆的语义分割 [127]、增强罕见事件训练数据 [128]、合成鱼眼摄像头图像污渍 [129]、改善恶劣天气下的高速公路交通图像 [130]及道路分割 [131],并通过数据增强提升分类器泛化能力 [132]。
SoPhie [133], an innovative framework based on GAN, addresses the vital task of path prediction for interacting agents in autonomous scenarios by seamlessly integrating physical and social information through a novel combination of social and physical attention mechanisms, achieving remarkable ADE and FDE scores of 0.70 and 1.43, respectively on the ETH dataset and setting a new standard in trajectory forecasting benchmarks for self-driving cars applications.
SoPhie [133] 是一个基于生成对抗网络(GAN)的创新框架,通过新颖结合社交和物理注意力机制,融合物理与社交信息,解决自动驾驶场景中交互代理的路径预测关键任务,在ETH数据集上分别实现了0.70和1.43的显著平均位移误差(ADE)和最终位移误差(FDE)分数,树立了自动驾驶车辆轨迹预测基准的新标杆。
Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM) are fundamental metrics extensively used in the field of image and video quality assessment. These metrics are crucial for quantifying the fidelity and visual quality of images and videos by comparing them to original, uncompressed, or distortion-free versions.
峰值信噪比(PSNR)和结构相似性指数(SSIM)是图像和视频质量评估领域广泛使用的基本指标。这些指标通过将图像和视频与原始、无压缩或无失真版本进行比较,量化其保真度和视觉质量,具有重要意义。

FIGURE 11. the training process of a GAN for application to a traffic scene understanding problem: The generator (G) creates synthetic traffic scenes from a random noise vector
图11. 用于交通场景理解问题的生成对抗网络(GAN)训练过程:生成器(G)从潜在分布(如高斯分布)采样的随机噪声向量
The TSR-GAN model proposed in [125] effectively mines and estimates traffic correlations and patterns, setting a new benchmark for spatio-temporal traffic state reconstruction. In comprehensive comparisons, TSR-GAN excels by achieving the highest traffic state similarity (TSS), formulated as
文献[125]提出的TSR-GAN模型有效挖掘并估计交通相关性和模式,在时空交通状态重建方面树立了新基准。在全面比较中,TSR-GAN以32.595的交通状态相似度(TSS)得分表现卓越。此外,其误差指标最低,包括均方根误差(RMSE)6.585、平均绝对误差(MAE)5.205和平均绝对百分比误差(MAPE)8.671%,优于GASM、CED、SRGAN及其变体,展现了TSR-GAN在多样条件下重建交通状态的卓越精度和适应性。
The study in [126] evaluates the effectiveness of GAN-based enhancement methods, specifically SRGAN [134] and DeblurGAN [135], in refining video frame predictions made by another generative model, FutureGAN [136], to significantly improve object detection for autonomous vehicles,demonstrating a notable
文献[126]评估了基于GAN的增强方法,特别是SRGAN [134]和DeblurGAN [135],在提升另一生成模型FutureGAN [136]的视频帧预测质量方面的有效性,显著提高了自动驾驶车辆目标检测的性能,增强帧在车辆检测的平均精度(AP)上表现出显著提升
A modified CycleGAN [137] introduced in [128] effectively demonstrates the use of GANs for augmenting training data for rare events in autonomous systems, achieving an improvement in mAP for perception tasks from 44.5% to
文献[128]中引入的改进型CycleGAN [137]有效展示了GAN在增强自动驾驶系统中罕见事件训练数据的应用,使感知任务的平均精度均值(mAP)从44.5%提升至
The authors in [129] propose two algorithms for soiling synthesis on fisheye camera images. The first is a CycleGAN-based baseline [137], and the second is DirtyGAN. Both algorithms deliver comparable end-to-end results. Dirty-GAN, a GAN-based approach, improves soiling detection by
文献[129]提出了两种鱼眼相机图像污渍合成算法。第一种为基于CycleGAN的基线方法[137],第二种为DirtyGAN。两种算法在端到端结果上表现相当。基于GAN的DirtyGAN通过结合真实与合成图像训练,将污渍检测提升了
The study in [130] introduces a highly effective highway traffic image enhancement algorithm for adverse weather conditions, achieving remarkable performance gains of 21.97% and 12.89% in nighttime enhancement, 26.16% and 12.75% in rain removal, and 26.56% and 12.1% in fog removal for PSNR and SSIM metrics respectively, showcasing its superior capability in detail retention and noise reduction.
文献[130]提出了一种针对恶劣天气条件下高速公路交通图像的高效增强算法,在夜间增强中分别实现了21.97%和12.89%的PSNR和SSIM显著提升,雨天去除中分别提升26.16%和12.75%,雾天去除中分别提升26.56%和12.1%,展现了其在细节保留和噪声抑制方面的卓越能力。
In [127] (referred to as "MTPanClass" in our work), a model is proposed to refine the segmentation of target main bodies by leveraging the pan-class intrinsic relevance among multiple targets. This approach includes a novel use of generative adversarial learning, which integrates intrinsic relevance features with semantic features to enhance segmentation. MTPanClass achieves mIoU scores of 49.8%,
在文献[127]中(在我们的工作中称为“MTPanClass”),提出了一种模型,通过利用多个目标之间的泛类内在相关性来细化目标主体的分割。该方法创新性地采用了生成对抗学习,将内在相关性特征与语义特征相结合以增强分割效果。MTPanClass在ADE20K、PASCALContext、KITTI和Cityscapes数据集上分别实现了49.8%、
IEC-Net, presented in [131], is an image enhancement network based on CycleGAN [137], specifically designed to improve road segmentation under diverse weather conditions. When tested on the Cityscapes dataset under severe weather scenarios, IEC-Net achieved an mIoU of 89.3%, showcasing significant improvements in segmentation accuracy when integrated with state-of-the-art segmentation models.
IEC-Net在文献[131]中提出,是一种基于CycleGAN [137]的图像增强网络,专门设计用于提升多种天气条件下的道路分割效果。在Cityscapes数据集的恶劣天气场景测试中,IEC-Net实现了89.3%的mIoU,显示出与最先进分割模型结合时分割精度的显著提升。
The AttGAN model proposed in [138] is utilized in [132] to introduce a novel data augmentation approach. This method leverages attribute-conditioned generative models to semantically modify training data, significantly enhancing the generalization capabilities of deep classifiers across varying times of day and weather conditions. Notably, this approach achieved an F1-score of
文献[138]中提出的AttGAN模型被用于文献[132]中引入一种新颖的数据增强方法。该方法利用属性条件生成模型对训练数据进行语义修改,显著提升了深度分类器在不同时间和天气条件下的泛化能力。值得注意的是,该方法在BDD数据集的语义域适应任务中,使用原始白天图像和合成夜间图像训练时,达到了
GANs are highly effective for generating realistic data, including high-quality images and other forms of synthetic content. However, they are notoriously difficult to train, often facing challenges such as instability and mode collapse, where the model produces limited variations of the data. Successful training of GANs requires meticulous tuning of hyperparameters and network architecture, as well as access to large datasets. These factors make GANs computationally intensive and difficult to scale, especially for complex or high-resolution tasks.
生成对抗网络(GAN)在生成逼真数据(包括高质量图像及其他合成内容)方面非常有效。然而,GAN训练过程 notoriously 困难,常面临不稳定和模式崩溃(mode collapse)等挑战,即模型生成的数据变异性有限。成功训练GAN需要精细调整超参数和网络结构,并依赖大规模数据集。这些因素使得GAN计算资源消耗大且难以扩展,尤其是在处理复杂或高分辨率任务时。
A cGAN is an extension of the standard GAN that allows for the conditional generation of data based on input labels or information. The cGAN model comprises two neural networks—a generator
条件生成对抗网络(cGAN)是标准GAN的扩展,允许基于输入标签或信息进行条件数据生成。cGAN模型包含两个神经网络——生成器
The generator
生成器
The discriminator
判别器
The objective functions for the generator and discriminator are derived from the original GAN framework but are conditioned on
生成器和判别器的目标函数源自原始GAN框架,但均以
The discriminator tries to maximize the probability of correctly classifying the real data and minimize the probability of incorrectly classifying the generated data. The loss function for the discriminator is:
判别器试图最大化正确分类真实数据的概率,同时最小化错误分类生成数据的概率。判别器的损失函数为:
The generator tries to minimize the probability that the discriminator correctly distinguishes between real and generated data. The loss function for the generator is:
生成器试图最小化判别器正确区分真实与生成数据的概率。生成器的损失函数为:
In practice,
在实际中,
The training process alternates between updating the discriminator
训练过程交替进行:通过最大化
The introduction of the Two-Stream Conditional Generative Adversarial Network (TScGAN) in [139] significantly improves mIoU scores across various state-of-the-art CNN-based semantic segmentation models, with increases such as 77.2% to 79.0% for DeepLabV3 and 81.6% to 83.6% for HRNet. TScGAN enhances both segmentation accuracy and processing speed by addressing higher-order inconsistencies in semantic segmentation and effectively utilizing dual input streams to preserve high-level contextual information. These improvements are particularly evident when applied to smaller image sizes (e.g., 512 - 512) on datasets like Cityscapes.
文献[139]中提出的双流条件生成对抗网络(TScGAN)显著提升了多种最先进基于CNN的语义分割模型的mIoU分数,例如DeepLabV3从77.2%提升至79.0%,HRNet从81.6%提升至83.6%。TScGAN通过解决语义分割中的高阶不一致性并有效利用双输入流保留高级上下文信息,提升了分割精度和处理速度。这些改进在处理较小图像尺寸(如512×512)且应用于Cityscapes等数据集时尤为明显。
Variational Autoencoders, as introduced in [140], mark a significant advancement in generative modeling by combining deep learning with variational inference. Their core innovation lies in the use of a latent variable model to effectively capture complex data distributions. This approach provides a robust framework for approximating these distributions. VAEs are trained using SGD, which ensures efficient optimization and training. This methodology not only enhances the model's generative capabilities but also aids in uncovering the underlying structure of the data, making VAEs highly versatile in generative modeling tasks.
变分自编码器(Variational Autoencoders,VAE),如文献[140]所述,通过将深度学习与变分推断相结合,在生成建模领域取得了重大进展。其核心创新在于使用潜变量模型有效捕捉复杂的数据分布。这种方法为近似这些分布提供了一个稳健的框架。VAE采用随机梯度下降(SGD)进行训练,确保了优化和训练的高效性。该方法不仅提升了模型的生成能力,还帮助揭示数据的潜在结构,使VAE在生成建模任务中具有高度的通用性。
Figure 12 illustrates the training process of a VAE applied to a traffic scene reconstruction problem. VAEs are based on a latent variable model:
图12展示了将VAE应用于交通场景重建问题的训练过程。VAE基于潜变量模型:
where
其中
The goal is to infer the posterior distribution
目标是推断后验分布

FIGURE 12. Training of a VAE for a traffic scene reconstruction: An observed input image
图12. 用于交通场景重建的VAE训练:观测输入图像
The training of VAEs involves maximizing the Evidence Lower Bound
VAE的训练涉及最大化边际似然的证据下界(ELBO)
where
其中
To enable gradient-based optimization, VAEs use the reparameterization trick, which allows the model to back-propagate through random nodes. If
为了实现基于梯度的优化,VAE采用重参数技巧,使模型能够通过随机节点进行反向传播。如果
where
其中
where
其中
In practice, VAEs are implemented using NNs. The encoder network approximates
在实际应用中,变分自编码器(VAE)通常通过神经网络(NNs)实现。编码器网络近似
VAEs are essential in traffic scene analysis, excelling in unsupervised tasks like data generation, denoising, and feature extraction. They create realistic and adversarial scenarios to enhance automated driving systems' robustness and are crucial for anomaly detection, boosting driving safety and efficiency. Applications include improving TLD [141], segmenting navigable spaces [142], detecting out-of-distribution (OOD) images in multi-label datasets [143], generating realistic traffic scenes [144], detecting adversarial driving scenes [145], and detecting traffic anomalies [146].
变分自编码器(VAE)在交通场景分析中至关重要,擅长无监督任务如数据生成、去噪和特征提取。它们通过创建逼真且对抗性的场景,提升自动驾驶系统的鲁棒性,并在异常检测中发挥关键作用,增强驾驶安全性和效率。应用包括提升交通灯检测(TLD)[141]、可通行空间分割[142]、多标签数据集中检测分布外(OOD)图像[143]、生成逼真交通场景[144]、检测对抗性驾驶场景[145]及交通异常检测[146]。
VATLD [141] adapts a state-of-the-art
VATLD [141]基于最先进的
NSS-VAE [142] is a dual-VAE architecture that excels in unsupervised segmentation of navigable spaces, surpassing 90% accuracy on the KITTI road benchmark. It outperforms traditional supervised methods, especially where ground truth labels are scarce. By merging deep features with GCNs to manage boundary uncertainties, NSS-VAE shows strong potential for autonomous navigation.
NSS-VAE [142]是一种双重VAE架构,擅长无监督的可通行空间分割,在KITTI道路基准测试中准确率超过90%。其性能优于传统的监督方法,尤其在缺乏真实标签的情况下表现突出。通过将深度特征与图卷积网络(GCNs)结合以处理边界不确定性,NSS-VAE展现出自动导航的强大潜力。
The approach in [143],based on
[143]中的方法基于
SceneGen [144] presents a neural autoregressive model for traffic scenes, generating new examples and evaluating existing ones without rules or heuristics, providing a flexible, scalable way to model real-world traffic complexity. It demonstrates significant realism improvements, with the lowest Negative Log-Likelihood (NLL) of 59.86 and an enhanced detection AP from 85.9% (using LayoutVAE) to 90.4% on the ATG4D dataset.
SceneGen [144]提出了一种用于交通场景的神经自回归模型,能够生成新样本并评估现有样本,无需规则或启发式方法,提供了一种灵活且可扩展的方式来建模真实交通的复杂性。该模型显著提升了逼真度,负对数似然(NLL)最低为59.86,ATG4D数据集上的检测平均精度(AP)从使用LayoutVAE的85.9%提升至90.4%。
A tree-structured VAE (T-VAE) for Semantically Adversarial Generation (SAG), designed to detect adversarial driving scenes in
[145]提出了一种用于语义对抗生成(SAG)的树结构变分自编码器(T-VAE),旨在检测
In [146], an attention-based VAE (A-VAE) with 2D CNN and BiLSTM layers improved Recurrent VAE for anomaly detection on the UCSD dataset. The Recurrent VAE achieved
[146]中提出了一种基于注意力机制的变分自编码器(A-VAE),结合了二维卷积神经网络(2D CNN)和双向长短时记忆网络(BiLSTM)层,改进了用于UCSD数据集异常检测的循环变分自编码器(Recurrent VAE)。循环VAE实现了
Clustering-based DA improves traffic scene understanding by clustering data to reveal shared structures, reducing feature discrepancies across weather, camera views, and sensor types. It enhances Person and Vehicle Re-ID by capturing domain-invariant features, with centroid alignment further closing domain gaps, and strengthens multi-object tracking and action recognition through refined temporal and spatial consistency. However, clustering-based DA faces challenges. It requires careful tuning to prevent misalignment in complex scenes, which can degrade performance if clusters capture noise instead of meaningful features. Additionally, the method may struggle in dynamic environments where clusters shift over time, affecting the consistency of multi-object tracking and action recognition. Managing computational demands and scalability also becomes challenging, especially in high-traffic scenarios with extensive data streams.
基于聚类的领域自适应(DA)通过聚类数据揭示共享结构,减少不同天气、摄像机视角和传感器类型之间的特征差异,从而提升交通场景理解。它通过捕捉领域不变特征增强了行人和车辆重识别(Re-ID),质心对齐进一步缩小领域差距,并通过精细的时空一致性强化了多目标跟踪和动作识别。然而,基于聚类的领域自适应面临挑战。它需要精细调参以防止复杂场景中的错位,如果聚类捕获的是噪声而非有意义特征,性能会下降。此外,该方法在动态环境中可能表现不佳,因为聚类随时间变化,影响多目标跟踪和动作识别的一致性。管理计算需求和扩展性也变得困难,尤其是在高流量场景中处理大量数据流时。
HPO critically improves generative ML models for traffic scene understanding. By tuning parameters like learning rate, batch size, and latent dimensions in GANs and VAEs, models produce more realistic, diverse synthetic scenarios, enhancing autonomous driving and traffic management algorithms. Optimized models handle rare cases, improve realism, stability, and generalization, prevent mode collapse, and accelerate convergence, thus reducing training time and resources. This ensures robust, efficient, and safe performance under diverse real-world conditions.
超参数优化(HPO)显著提升了用于交通场景理解的生成式机器学习模型。通过调节GAN(生成对抗网络)和VAE(变分自编码器)中的学习率、批量大小和潜在维度等参数,模型能够生成更真实、多样的合成场景,增强自动驾驶和交通管理算法。优化后的模型能处理罕见情况,提高真实感、稳定性和泛化能力,防止模式崩溃,加速收敛,从而减少训练时间和资源消耗。这确保了模型在多样化真实环境下的稳健、高效和安全性能。
Reviewing GAN models, Adam optimizer was utilized with a learning rate of 0.001 for FutureGAN in [126], gradually reducing the learning rate and using different values for
回顾GAN模型,文献[126]中FutureGAN使用Adam优化器,初始学习率为0.001,逐步降低学习率,并针对
In VAE models, authors of [148] trained their network using the Adam optimizer with a learning rate of 0.0001 , setting the momentums to 0.5 and 0.999 . They used a batch size of 1 and defined specific parameter values
在VAE模型中,文献[148]使用Adam优化器训练网络,学习率为0.0001,动量参数设为0.5和0.999,批量大小为1,并为损失函数定义了特定参数值
For cGANs, TScGAN [139] was trained using a learning rate of 0.0005 and a batch size of 32 across 150 training epochs. These training settings led to notable improvements in mean mIoU scores across different segmentation models, with DeepLabV3 achieving an increase from 77.2 to 79.0 and HRNet improving from 81.6 to 83.6.
对于条件生成对抗网络(cGANs),TScGAN[139]采用学习率0.0005,批量大小32,训练150个epoch。这些训练设置显著提升了不同分割模型的平均交并比(mIoU)得分,DeepLabV3从77.2提升至79.0,HRNet从81.6提升至83.6。
A comparison of different categories of generative ML models is presented in table Table 3. For the classification section, when applied to the Cityscapes dataset, Cycle-GAN [129] improved to an mIoU of 78.20%, while Dirty-GAN [129] achieved a higher mIoU of 91.71%. Meanwhile, the AttGAN [132] model achieved an F1-score of 96% for car classification when using synthetic snowy data generated by AttGAN compared to a score of
表3展示了不同类别生成式机器学习模型的比较。在分类部分,应用于Cityscapes数据集时,Cycle-GAN[129]的mIoU提升至78.20%,而Dirty-GAN[129]则达到更高的91.71%。同时,AttGAN[132]模型在使用AttGAN生成的合成雪天数据进行汽车分类时,F1分数达到96%,相比之下未使用合成数据训练的分类器得分为
GANs have also been applied for traffic image enhancement. For the Cityscapes dataset, multiple GAN methods were evaluated, with FutureGAN [126] achieving a PSNR of 22.38 and an SSIM of 0.61 . Compared to these results, the DeblurGAN [126] obtains a PSNR of 21.95 and SSIM of 0.59 , while SRGAN [126] had a lower PSNR of 20.49 and SSIM of 0.49 . On the RainDegraded dataset, DCGAN [130] achieved a PSNR of 24.98 and SSIM of 0.81, with ImprovedGAN [130] further improving these metrics to 25.81 and 0.84 , respectively. Finally, on the FogDegraded (RESIDE) dataset, DeblurGAN [130] achieved a PSNR of 24.42 and SSIM of 0.81, while ImprovedGAN [130] achieved 25.79 and 0.88 , respectively.
生成对抗网络(GANs)也被应用于交通图像增强。在Cityscapes数据集上,评估了多种GAN方法,其中FutureGAN [126] 达到了22.38的峰值信噪比(PSNR)和0.61的结构相似性指数(SSIM)。相比之下,DeblurGAN [126] 获得了21.95的PSNR和0.59的SSIM,而SRGAN [126] 的PSNR和SSIM分别较低,为20.49和0.49。在RainDegraded数据集上,DCGAN [130] 达到了24.98的PSNR和0.81的SSIM,ImprovedGAN [130] 进一步将这些指标提升至25.81和0.84。最后,在FogDegraded(RESIDE)数据集上,DeblurGAN [130] 实现了24.42的PSNR和0.81的SSIM,而ImprovedGAN [130] 分别达到了25.79和0.88。
The methods presented in the scene generation section in Table 3 are all based on vehicle action recognition. For the ATG4D large-scale traffic scene dataset, LayoutVAE [144] achieved an NLL of 210.80 nats (where "nats" denotes the unit of measurement for NLL when using the natural logarithm). On the same dataset, SceneGen [144] achieved a significantly improved NLL of 59.86 nats. For the Semantic KITTI dataset,T-VAE [145] reported an RE of
表3中场景生成部分展示的方法均基于车辆动作识别。对于ATG4D大规模交通场景数据集,LayoutVAE [144] 实现了210.80 nats的负对数似然(NLL)(其中“nats”表示使用自然对数时NLL的计量单位)。在同一数据集上,SceneGen [144] 显著提升了NLL至59.86 nats。对于Semantic KITTI数据集,T-VAE [145] 报告了RE为
Domain Adaptation (DA) methods are essential for improving traffic scene understanding across diverse environments. Traditional models struggle with distribution differences between training and testing datasets, resulting in poor generalization. DA enables models trained in one domain (e.g., specific weather conditions or regions) to effectively handle new, unseen domains. Unlike earlier approaches relying on hand-crafted features-prone to biases and limited expressiveness-DL-based DA uses DNNs for automatic feature extraction, better capturing complex, high-dimensional relationships.
领域自适应(Domain Adaptation,DA)方法对于提升不同环境下的交通场景理解至关重要。传统模型难以应对训练和测试数据分布差异,导致泛化能力差。DA使得在一个领域(如特定天气条件或地区)训练的模型能够有效处理新的、未见过的领域。与早期依赖易受偏差且表达能力有限的手工特征的方法不同,基于深度学习的DA利用深度神经网络(DNN)自动提取特征,更好地捕捉复杂的高维关系。
While there exist different ways of categorizing deep DA methods, they can broadly be divided into three classes: clustering-based, discrepancy-based, and adversarial-based approaches. Clustering-based methods aim to group target domain data points with similar features to those in the source domain, facilitating knowledge transfer through clustering techniques. Discrepancy-based methods focus on minimizing statistical distances, like Maximum Mean Discrepancy (MMD), between source and target feature distributions for better alignment. Adversarial-based methods use adversarial learning techniques to reduce the gap between source and target domains by training a model to fool a domain discriminator, making features indistinguishable.
虽然深度领域自适应方法有多种分类方式,但大致可分为三类:基于聚类、基于差异和基于对抗的方法。基于聚类的方法旨在将目标域数据点与源域中具有相似特征的数据点分组,通过聚类技术促进知识迁移。基于差异的方法侧重于最小化源域和目标域特征分布之间的统计距离,如最大均值差异(MMD),以实现更好的对齐。基于对抗的方法则利用对抗学习技术,通过训练模型欺骗域判别器,缩小源域和目标域之间的差距,使特征难以区分。
DA models are crucial for traffic scene understanding, allowing DNNs to adapt to varying lighting, weather, and geographic conditions without extensive retraining or large labeled datasets. This flexibility enables a single model to function effectively across diverse regions. By ensuring smooth knowledge transfer, DA models improve accuracy, reliability, and efficiency in traffic analysis and prediction, ultimately making transportation networks safer and more efficient.
领域自适应模型对于交通场景理解至关重要,使深度神经网络能够适应不同的光照、天气和地理条件,无需大量重新训练或标注数据。这种灵活性使单一模型能够在多样化区域中有效运行。通过确保知识的平滑迁移,领域自适应模型提升了交通分析和预测的准确性、可靠性和效率,最终使交通网络更安全、更高效。
In the following sections, we explore the overarching categories of models and techniques in depth. The mechanisms behind each specific adaptation strategy, their real-world applications, and their ability to address data variability and improve model generalization are thoroughly examined. Finally, We also discuss HPO for these strategies and compare their performance metrics to provide a comprehensive overview.
在接下来的章节中,我们将深入探讨各类模型和技术。详细分析每种具体自适应策略的机制、实际应用及其解决数据变异性和提升模型泛化能力的能力。最后,我们还将讨论这些策略的超参数优化(HPO)并比较其性能指标,以提供全面的概览。
Clustering-based DA is a technique that helps a model trained on one domain, the source domain, perform well on another domain, the target domain, by using clustering techniques to identify shared structures between the two domains. The main idea is to group data points from both domains into clusters that capture common characteristics and use these clusters to guide the adaptation process.
基于聚类的领域自适应是一种技术,通过聚类方法识别源域和目标域之间的共享结构,帮助在一个域(源域)训练的模型在另一个域(目标域)上表现良好。其核心思想是将两个域的数据点分组为捕捉共同特征的簇,并利用这些簇指导自适应过程。
Figure 13 depicts the application of clustering-based DA to image classification in a traffic scene. Let
图13展示了基于聚类的领域自适应(DA)在交通场景图像分类中的应用。设
Clustering-based DA works by grouping both source and target domain data into clusters (or pseudo-labels) and aligning these clusters across domains. The core hypothesis is that the shared clusters between the two domains capture common features that help the model generalize from the source to the target domain.
基于聚类的领域自适应通过将源域和目标域数据分组为聚类(或伪标签),并在域间对齐这些聚类来实现。核心假设是两个域之间共享的聚类捕捉了有助于模型从源域泛化到目标域的共同特征。
Let
设
where
其中
To adapt between the source and target domains, we want to ensure that the clusters in the source domain align with the clusters in the target domain. This can be formalized as minimizing the difference between the source and target clusters. Specifically, we define a domain alignment loss
为了实现源域和目标域之间的适应,我们希望确保源域的聚类与目标域的聚类对齐。这可以形式化为最小化源域和目标域聚类之间的差异。具体地,我们定义了基于对齐源域和目标域对应聚类质心的领域对齐损失
where
其中
In addition to clustering the target domain data, we can also assign pseudo-labels to the target data based on the clustering. The pseudo-label for a target domain data point
除了对目标域数据进行聚类外,我们还可以基于聚类结果为目标数据分配伪标签。目标域数据点
TABLE 3. A comprehensive comparison of various generative ML models applied to traffic scene understanding, highlighting the differences in applications, frameworks, variance across models, datasets utilized, performance metrics, and the resulting effectiveness in their respective applications.
表3. 各种生成式机器学习模型在交通场景理解中的综合比较,重点展示了它们在应用、框架、模型间差异、使用的数据集、性能指标及其在各自应用中的效果差异。
| Application | Framework | Variance | Dataset | Performance Metric | Result |
| Classification | GAN | CycleGAN [128] | RareEvents_DS | mAP | 45.5% |
| GAN | AttGAN [132] | BDD | F1-score | 86% | |
| Object Detection | VAE | VATLD [141] | BSTLD | AP@IoU50 | 0.49 |
| VAE | MobileNet V1 [43] | BSTLD | AP@IoU50 | 0.48 | |
| Segmentation | GAN | MTPanClass [127] | Cityscapes | mloU | 89.3% |
| GAN | CycleGAN [129] | Cityscapes | mIoU | 78.20% | |
| GAN | DirtyGAN [129] | Cityscapes | mloU | 91.71% | |
| GAN | CycleGAN [131] | CityScapes | mloU | 71.6% | |
| GAN | IEC-Net [131] | CityScapes | mloU | 89.3% | |
| cGAN | DeepLabV3+TScGAN [139] | Cityscapes | mIoU | 79.0% | |
| cGAN | PSPNet+TScGAN [139] | Cityscapes | mloU | 81.3% | |
| cGAN | HRNet+TScGAN [139] | Cityscapes | mIoU | 83.6% | |
| cGAN | HMSA+TScGAN [139] | Cityscapes | mIoU | 86.8% | |
| VAE | NSS-VAE [142] | KITTI | Accuracy | 90% | |
| VAE | \( \beta \) -VAE [143] | nuScenes | Detection Rate | 74% | |
| Image Enhancement | GAN | FutureGAN [126] | Cityscapes | PSNR, SSIM | 22.38, 0.61 |
| GAN | DeblurGAN [126] | Cityscapes | PSNR, SSIM | 21.95, 0.59 | |
| GAN | SRGAN [126] | Cityscapes | PSNR, SSIM | 20.49, 0.49 | |
| GAN | DCGAN [130] | RainDegraded | PSNR, SSIM | 24.98, 0.81 | |
| GAN | ImprovedGAN [130] | RainDegraded | PSNR, SSIM | 25.81, 0.84 | |
| GAN | DeblurGAN [130] | FogDegraded (RESIDE) | PSNR, SSIM | 24.42, 0.81 | |
| GAN | ImprovedGAN [130] | FogDegraded (RESIDE) | PSNR, SSIM | 25.79, 0.88 | |
| Reconstructing Traffic States | GAN | TSRGAN [125] | NGSIM | TSS | 32.595 |
| Scene Generation | VAE | LayoutVAE [144] | ATG4D | NLL | 210.80 |
| VAE | SceneGen [144] | ATG4D | NLL | 59.86 | |
| VAE | VAE [145] | Semantic KITTI | RE | \( {110.4} \pm {10.6} \) | |
| VAE | VAE-WR [145] | Semantic KITTI | RE | \( {105.9} \pm {24.6} \) | |
| VAE | GVAE [145] | Semantic KITTI | RE | \( {123.7} \pm {9.5} \) | |
| VAE | T-VAE [145] | Semantic KITTI | RE | \( {135.1} \pm {16.9} \) | |
| VAE | T-VAE-SAG [145] | Semantic KITTI | RE | \( {14.5} \pm {1.3} \) | |
| Anomaly Detection | VAE | Recurrent VAE [146] | UCSD | AUC, EER | 90.4%, 15.8% |
| VAE | A-VAE [146] | UCSD | AUC, EER | 91.7%, 18.2% | |
| Path Prediction | GAN | Sophie [133] | ETH | ADE, FDE | 0.70, 1.43 |
| 应用 | 框架 | 方差 | 数据集 | 性能指标 | 结果 |
| 分类 | 生成对抗网络(GAN) | CycleGAN [128] | RareEvents_DS | 平均精度均值(mAP) | 45.5% |
| 生成对抗网络(GAN) | AttGAN [132] | BDD | F1分数 | 86% | |
| 目标检测 | 变分自编码器(VAE) | VATLD [141] | BSTLD | AP@IoU50 | 0.49 |
| 变分自编码器(VAE) | MobileNet V1 [43] | BSTLD | AP@IoU50 | 0.48 | |
| 分割 | 生成对抗网络(GAN) | MTPanClass [127] | Cityscapes | 平均交并比(mloU) | 89.3% |
| 生成对抗网络(GAN) | CycleGAN [129] | Cityscapes | 平均交并比(mIoU) | 78.20% | |
| 生成对抗网络(GAN) | DirtyGAN [129] | Cityscapes | 平均交并比(mloU) | 91.71% | |
| 生成对抗网络(GAN) | CycleGAN [131] | CityScapes | 平均交并比(mloU) | 71.6% | |
| 生成对抗网络(GAN) | IEC-Net [131] | CityScapes | 平均交并比(mloU) | 89.3% | |
| 条件生成对抗网络(cGAN) | DeepLabV3+TScGAN [139] | Cityscapes | 平均交并比(mIoU) | 79.0% | |
| 条件生成对抗网络(cGAN) | PSPNet+TScGAN [139] | Cityscapes | 平均交并比(mloU) | 81.3% | |
| 条件生成对抗网络(cGAN) | HRNet+TScGAN [139] | Cityscapes | 平均交并比(mIoU) | 83.6% | |
| 条件生成对抗网络(cGAN) | HMSA+TScGAN [139] | Cityscapes | 平均交并比(mIoU) | 86.8% | |
| 变分自编码器(VAE) | NSS-VAE [142] | KITTI | 准确率 | 90% | |
| 变分自编码器(VAE) | \( \beta \) -VAE [143] | nuScenes | 检测率 | 74% | |
| 图像增强 | 生成对抗网络(GAN) | FutureGAN [126] | Cityscapes | 峰值信噪比(PSNR), 结构相似性指数(SSIM) | 22.38, 0.61 |
| 生成对抗网络(GAN) | DeblurGAN [126] | Cityscapes | 峰值信噪比(PSNR), 结构相似性指数(SSIM) | 21.95, 0.59 | |
| 生成对抗网络(GAN) | SRGAN [126] | Cityscapes | 峰值信噪比(PSNR), 结构相似性指数(SSIM) | 20.49, 0.49 | |
| 生成对抗网络(GAN) | DCGAN [130] | 雨天退化 | 峰值信噪比(PSNR), 结构相似性指数(SSIM) | 24.98, 0.81 | |
| 生成对抗网络(GAN) | ImprovedGAN [130] | 雨天退化 | 峰值信噪比(PSNR), 结构相似性指数(SSIM) | 25.81, 0.84 | |
| 生成对抗网络(GAN) | DeblurGAN [130] | 雾天退化 (RESIDE) | 峰值信噪比(PSNR), 结构相似性指数(SSIM) | 24.42, 0.81 | |
| 生成对抗网络(GAN) | ImprovedGAN [130] | 雾天退化 (RESIDE) | 峰值信噪比(PSNR), 结构相似性指数(SSIM) | 25.79, 0.88 | |
| 交通状态重建 | 生成对抗网络(GAN) | TSRGAN [125] | NGSIM | TSS | 32.595 |
| 场景生成 | 变分自编码器(VAE) | LayoutVAE [144] | ATG4D | 负对数似然(NLL) | 210.80 |
| 变分自编码器(VAE) | SceneGen [144] | ATG4D | 负对数似然(NLL) | 59.86 | |
| 变分自编码器(VAE) | 变分自编码器(VAE) [145] | Semantic KITTI | RE | \( {110.4} \pm {10.6} \) | |
| 变分自编码器(VAE) | VAE-WR [145] | Semantic KITTI | RE | \( {105.9} \pm {24.6} \) | |
| 变分自编码器(VAE) | GVAE [145] | Semantic KITTI | RE | \( {123.7} \pm {9.5} \) | |
| 变分自编码器(VAE) | T-VAE [145] | Semantic KITTI | RE | \( {135.1} \pm {16.9} \) | |
| 变分自编码器(VAE) | T-VAE-SAG [145] | Semantic KITTI | RE | \( {14.5} \pm {1.3} \) | |
| 异常检测 | 变分自编码器(VAE) | 循环变分自编码器(Recurrent VAE)[146] | 加州大学圣地亚哥分校(UCSD) | 曲线下面积(AUC),等错误率(EER) | 90.4%, 15.8% |
| 变分自编码器(VAE) | A-VAE [146] | 加州大学圣地亚哥分校(UCSD) | 曲线下面积(AUC),等错误率(EER) | 91.7%, 18.2% | |
| 路径预测 | 生成对抗网络(GAN) | Sophie [133] | ETH | 平均位移误差(ADE),最终位移误差(FDE) | 0.70, 1.43 |
where
其中
The objective function in clustering-based DA consists of three parts. The first part is the source domain loss, which is the classification loss on the source domain, and can be a standard supervised learning loss (e.g., cross-entropy):
基于聚类的领域自适应(DA)中的目标函数由三部分组成。第一部分是源域损失,即源域上的分类损失,可以是标准的监督学习损失(例如交叉熵):
where
其中
The second part is the domain alignment loss, which ensures the alignment between the cluster distributions in the source and target domains. Specifically, we minimize the distance between the cluster centroids in the source and target domains as follows:
第二部分是域对齐损失,确保源域和目标域中聚类分布的一致性。具体来说,我们最小化源域和目标域中聚类中心之间的距离,公式如下:
where
其中
The third part involves using the pseudo-labels from the target domain to compute a classification loss
第三部分涉及利用目标域的伪标签计算目标域上的分类损失
where
其中
The total loss function for clustering-based DA is the weighted sum of these three components:
基于聚类的领域自适应的总损失函数是这三部分的加权和:
where
其中

FIGURE 13. Clustering-based DA for image classification in a traffic scene: A CNN is used to extract features from both a labeled source dataset and an unlabeled target dataset, representing different traffic environments. These features are grouped into clusters for both domains to identify common feature patterns. In the source clusters, blue and orange dots represent distinct clusters (clusters
图13. 用于交通场景图像分类的基于聚类的领域自适应:使用卷积神经网络(CNN)从带标签的源数据集和无标签的目标数据集中提取特征,这两个数据集代表不同的交通环境。这些特征在两个域中被分组为聚类,以识别共同的特征模式。在源域聚类中,蓝色和橙色点分别代表不同的聚类(聚类
Clustering is performed on both the source and target domain data. The alignment between the clusters in the source and target domains is ensured by minimizing the distance between cluster centers, typically using a domain alignment loss as described above. Additionally, pseudo-labels for the target domain data allow the model to learn directly from the target domain in a semi-supervised manner. The model is trained on the source domain while regularizing with both the alignment loss and the pseudo-label loss to ensure that the model also works well on the target domain.
聚类操作同时在源域和目标域数据上进行。通过最小化聚类中心之间的距离(通常使用上述域对齐损失)来确保源域和目标域聚类的一致性。此外,目标域数据的伪标签使模型能够以半监督方式直接从目标域学习。模型在源域上训练,同时通过对齐损失和伪标签损失进行正则化,以确保模型在目标域上也能表现良好。
Clustering-based DA methods have advanced person reidentification (Person Re-ID) in smart surveillance systems [149], improved pedestrian tracking to enhance urban safety [150], optimized object detection for autonomous driving applications [151], and facilitated semantic segmentation in remote mapping for geographic information systems [152]. Moreover, these methods increase detection reliability across diverse environmental conditions [153].
基于聚类的领域自适应方法推动了智能监控系统中的行人重识别(Person Re-ID)[149],提升了行人跟踪以增强城市安全[150],优化了自动驾驶应用中的目标检测[151],并促进了地理信息系统中遥感地图的语义分割[152]。此外,这些方法提高了在多样环境条件下的检测可靠性[153]。
Contrastive learning [154] enhances robustness against occlusion by teaching models to distinguish similar and dissimilar objects. Treating occluded and unoccluded instances as positive pairs helps learn occlusion-invariant features, reducing reliance on full visibility and enabling recognition even when objects are partially obscured.
对比学习[154]通过教模型区分相似和不相似的对象,增强了对遮挡的鲁棒性。将遮挡和未遮挡的实例视为正样本对,有助于学习遮挡不变特征,减少对完全可见性的依赖,使得即使对象部分被遮挡也能被识别。
In [149], Cluster-based Dual-branch Contrastive Learning (CDCL) tackles data noise and clothing color confusion in unsupervised domain adaptation (UDA) for Person Re-ID. Building on contrastive learning principles [155], CDCL uses partially grayed images and a dual-branch network, achieving
在[149]中,基于聚类的双分支对比学习(Cluster-based Dual-branch Contrastive Learning, CDCL)解决了无监督域适应(UDA)中数据噪声和服装颜色混淆的问题。基于对比学习原理[155],CDCL使用部分灰度图像和双分支网络,实现了从DukeMTMC-ReID到Market1501的
A deep mutual distillation (DMD) framework for UDA Person Re-ID is introduced, drawing inspiration from the teacher-student paradigm [156]. This framework employs two parallel feature extraction branches that act as teachers for each other, enhancing pseudo-label quality. Combined with a bilateral graph representation to align identity features via visual and attribute consistency, this approach achieves 92.7% mAP from DukeMTMC-reID to Market1501.
提出了一种用于UDA行人重识别的深度互蒸馏(Deep Mutual Distillation, DMD)框架,借鉴了师生范式[156]。该框架采用两个并行的特征提取分支,彼此作为教师,提升伪标签质量。结合双边图表示,通过视觉和属性一致性对身份特征进行对齐,该方法实现了从DukeMTMC-reID到Market1501的92.7% mAP。
In [151], ConfMix addresses UDA in object detection with region-level confidence-based sample mixing. By blending target regions and confident pseudo detections from source images and adding consistency loss, it adapts the model to the target domain. Progressive pseudo-label filtering achieves
在[151]中,ConfMix通过基于区域置信度的样本混合解决了目标检测中的UDA问题。通过融合目标区域和来自源图像的高置信度伪检测结果,并加入一致性损失,使模型适应目标域。渐进式伪标签过滤实现了从KITTI到Cityscapes的
Semantic segmentation domain shifts are addressed through adversarial-based DA in FFREEDA (Federated source-Free Domain Adaptation) [152]. Leveraging unlabeled client data with a pre-trained server model, LADD (Learning Across Domains and Devices) employs adversarial self-supervision, ad-hoc regularization, and federated clustered aggregation with cluster-specific classifiers, achieving
语义分割的域偏移通过FFREEDA(联邦无源域适应)[152]中的对抗性域适应方法得到解决。利用无标签客户端数据和预训练服务器模型,LADD(跨域与设备学习)采用对抗自监督、特设正则化及带有聚类特定分类器的联邦聚类聚合,实现了从GTA5到Mapillary的
CFFA, a coarse-to-fine feature adaptation approach for cross-domain object detection, is proposed in [153]. It uses multi-layer adversarial learning for marginal alignment and global prototype matching for conditional alignment. Results include
在[153]中提出了CFFA,一种用于跨域目标检测的粗到细特征适应方法。它采用多层对抗学习进行边缘对齐,并通过全局原型匹配实现条件对齐。结果包括从Cityscapes到Foggy Cityscapes的
Clustering-based DA aids traffic scene understanding by grouping data into clusters that reveal shared structures, reducing feature differences across varying conditions (weather, camera views, sensors). It improves Person and Vehicle Re-ID by capturing domain-invariant features (body shape, vehicle silhouette). Techniques like centroid alignment and cluster-wise feature matching minimize domain gaps. Clustering-based DA enhances multi-object tracking (refining temporal and spatial consistency) and strengthens action recognition (leveraging contextual relations). It improves cross-domain representation, reduces retraining, and enables scalable performance for autonomous driving and traffic monitoring.
基于聚类的域适应通过将数据分组为揭示共享结构的簇,帮助交通场景理解,减少不同条件(天气、摄像头视角、传感器)下的特征差异。它通过捕捉域不变特征(如人体形态、车辆轮廓)提升行人和车辆重识别性能。质心对齐和簇内特征匹配等技术最小化域间差距。基于聚类的域适应增强了多目标跟踪(优化时空一致性)和动作识别(利用上下文关系),提升跨域表示能力,减少重训练,实现自动驾驶和交通监控的可扩展性能。
Discrepancy-based DA aims to minimize the difference between source and target domain distributions to transfer knowledge from a labeled source domain to a target domain with limited or no labeled data. The key challenge in this approach is addressing the distribution shift between the probability distributions of the source and target domains. By reducing the discrepancy between these feature distributions, the model trained on the source domain can generalize effectively to the target domain, ensuring better performance despite domain differences.
基于差异的域适应旨在最小化源域和目标域分布之间的差异,将带标签的源域知识迁移到标签有限或无标签的目标域。该方法的关键挑战是解决源域和目标域概率分布的分布偏移。通过减少这些特征分布间的差异,源域训练的模型能够有效泛化到目标域,确保在域差异存在时仍保持良好性能。
Figure 14 illustrates the application of discrepancy-based DA to object detection in a traffic scene under different conditions. Let
图14展示了基于差异的域适应在不同条件下交通场景目标检测中的应用。设
The first step is to learn the model on the source domain by minimizing a classification loss
第一步是在源域上通过最小化分类损失
where
其中
Next, the discrepancy between the source and target distributions in the feature space is minimized using a discrepancy distance metric. Common choices include Wasserstein distance, KL divergence, and MMD. Each metric has unique strengths and application scenarios, and understanding their differences is critical to selecting the appropriate tool for DA tasks.
接下来,使用差异距离度量最小化特征空间中源域和目标域分布的差异。常用的度量包括Wasserstein距离、KL散度和最大均值差异(MMD)。每种度量具有独特优势和适用场景,理解它们的差异对于选择合适的域适应工具至关重要。
The Wasserstein distance, also known as the Earth Mover's Distance, dates back to 1781 and was later formalized in a modern optimization framework by [82]. It measures the minimum cost of transporting one probability distribution to match another:
Wasserstein距离,也称为地球搬运者距离(Earth Mover's Distance),起源于1781年,后来由[82]在现代优化框架中形式化。它衡量将一个概率分布转移以匹配另一个概率分布的最小成本:
where
其中
However, the computational cost of calculating Wasser-stein distance is often higher than other metrics, as it involves solving a linear programming problem. This limits its applicability to scenarios with smaller datasets or where computational efficiency is paramount.
然而,计算Wasserstein距离的计算成本通常高于其他度量,因为它涉及求解线性规划问题。这限制了其在数据集较小或计算效率要求较高的场景中的应用。
The KL divergence, first introduced in [157], measures the relative entropy between the source and target distributions:
KL散度,最早由[157]提出,衡量源分布与目标分布之间的相对熵:
where
其中
Moreover, KL divergence tends to be more sensitive to outliers compared to the Wasserstein distance and, as will be discussed, the MMD. This sensitivity arises because KL divergence heavily penalizes regions where there is a discrepancy in probability mass, which can lead to overly aggressive adaptations, especially when the target distribution contains sparse or noisy data.
此外,与Wasserstein距离及后文将讨论的MMD相比,KL散度对异常值更为敏感。这种敏感性源于KL散度对概率质量差异区域的强烈惩罚,可能导致过度激进的适应,尤其当目标分布包含稀疏或噪声数据时。

FIGURE 14. Discrepancy-based DA for object detection in a traffic scene under different conditions: The process starts with a source dataset (representing familiar conditions like clear weather) and a target dataset (representing different conditions like snowy weather). Both datasets are processed through a DNN, which extracts relevant features from each dataset, referred to as “DNN Features.” These DNN features from the source and target datasets are then compared using a discrepancy loss module, which measures and minimizes the differences between the feature sets. This helps the model align the feature representations from both domains, improving its ability to detect objects even in the unfamiliar target domain. By reducing the discrepancy, the model can leverage what it learned from the source data to adapt effectively to the target conditions. The outputs on the right show detected objects in both the source and target datasets, illustrating how the model successfully performs object detection across different environments by minimizing discrepancies in feature representation. This enables more consistent detection results regardless of varying traffic scene conditions.
图14. 基于差异的领域自适应(DA)在不同条件下的交通场景目标检测:过程始于源数据集(代表熟悉条件如晴朗天气)和目标数据集(代表不同条件如雪天)。两个数据集均通过深度神经网络(DNN)处理,提取各自的相关特征,称为“DNN特征”。随后,源和目标数据集的DNN特征通过差异损失模块进行比较,测量并最小化特征集间的差异。这有助于模型对齐两个域的特征表示,提高其在不熟悉目标域中的目标检测能力。通过减少差异,模型能够利用从源数据学到的知识,有效适应目标条件。右侧输出显示了源和目标数据集中的检测目标,展示了模型通过最小化特征表示差异,成功实现跨不同环境的目标检测,从而在多变的交通场景条件下实现更一致的检测结果。
The MMD, introduced by [158], measures the distance between the means of two distributions in a Reproducing Kernel Hilbert Space (RKHS):
MMD,由[158]提出,衡量两个分布在再生核希尔伯特空间(RKHS)中均值的距离:
where
其中
The MMD is advantageous in that it does not require explicit density estimation of either distribution, making it computationally efficient and straightforward to implement with kernel methods. Unlike KL divergence, it can handle distributions with disjoint support and is less sensitive to outliers, which provides more stability during training.
MMD的优势在于不需要对任一分布进行显式的密度估计,使其计算高效且易于通过核方法实现。与KL散度不同,MMD能处理支持集不相交的分布,并且对异常值不敏感,训练时更稳定。
MMD's effectiveness depends on the chosen kernel, which affects how accurately it measures source-target discrepancies. A poorly selected kernel can yield suboptimal adaptation if it fails to capture complex distributional relationships. Compared to Wasserstein distance, MMD typically runs faster but is less interpretable in terms of physical distance.
MMD的效果依赖于所选核函数,核函数影响其测量源-目标差异的准确性。若核函数选择不当,可能无法捕捉复杂的分布关系,导致适应效果不佳。相比Wasserstein距离,MMD通常运行更快,但在物理距离的可解释性方面较弱。
Metric selection depends on domain adaptation specifics, such as disjoint support, computational constraints, and noise sensitivity. For high-dimensional generative tasks, Wasserstein distance may offer greater stability, while tasks with overlapping distributions might benefit from the faster convergence of KL divergence or MMD.
度量选择取决于领域自适应的具体情况,如支持集是否不相交、计算限制及噪声敏感性。对于高维生成任务,Wasserstein距离可能提供更高的稳定性,而支持集重叠的任务则可能受益于KL散度或MMD的更快收敛。
The optimization objective,denoted as
优化目标,记作
Discrepancy-based DA has significantly advanced several real-world computer vision applications. These applications include autonomous driving systems [159], urban safety monitoring [160], traffic surveillance [161], smart surveillance for Person Re-ID [162], digital recognition in smart city systems [163], and efficient resource management in autonomous systems [164].
基于差异的领域自适应(DA)显著推动了多个现实计算机视觉应用的发展。这些应用包括自动驾驶系统[159]、城市安全监控[160]、交通监控[161]、智能监控中的行人重识别(Person Re-ID)[162]、智慧城市系统中的数字识别[163]以及自主系统中的高效资源管理[164]。
Drawing on the teacher-student paradigm [156], [159] introduces Masked Retraining (MRT) for domain-adaptive object detection. Using a custom masked autoencoder and selective retraining,MRT achieves
借鉴师生范式[156],[159]提出了用于领域自适应目标检测的掩码重训练(Masked Retraining,MRT)。通过定制的掩码自编码器和选择性重训练,MRT实现了从Cityscapes到Foggy Cityscapes的
Building on DETR [94], [160] proposes a robust baseline for DETR-style detectors under domain shift. Incorporating Object-Aware Alignment (OAA) and Optimal Transport Alignment (OTA), it mitigates shifts in both backbone and decoder outputs,raising mAP to 46.8% for Cityscapes to Foggy Cityscapes adaptation.
基于DETR[94],[160]提出了在领域偏移下适用于DETR风格检测器的鲁棒基线。通过引入面向对象的对齐(Object-Aware Alignment,OAA)和最优传输对齐(Optimal Transport Alignment,OTA),缓解了主干网络和解码器输出的偏移,使Cityscapes到Foggy Cityscapes的mAP提升至46.8%。
ML-ANet (Multi-Label Adaptation Network) [161], reduces source-target domain discrepancy using multiple kernel variants with MMD. Task-specific hidden layers are embedded in RKHS (Reproducing Kernel Hilbert Space) to align feature distributions across domains, resulting in improved efficiency. ML-ANet achieves a mean accuracy of 94.83% for Cityscapes to Foggy Cityscapes benchmark.
ML-ANet(多标签适应网络)[161]利用多种核变体和最大均值差异(MMD)减少源域与目标域的差异。任务特定的隐藏层嵌入在再生核希尔伯特空间(RKHS)中,实现跨域特征分布的对齐,从而提升效率。ML-ANet在Cityscapes到Foggy Cityscapes基准上达到94.83%的平均准确率。
D-MMD (Dissimilarity-based MMD) loss [162] addresses the challenges of UDA in Person Re-ID by aligning pairwise dissimilarities between source and target domains rather than feature representations. This approach achieves an mAP of 48.8% on the DukeMTMC to Market1501 benchmark, without requiring data augmentation or complex network designs.
D-MMD(基于不相似性的最大均值差异)损失[162]通过对齐源域和目标域之间的成对不相似性而非特征表示,解决了行人重识别(Person Re-ID)中无监督领域自适应(UDA)的挑战。该方法在DukeMTMC到Market1501基准上实现了48.8%的mAP,无需数据增强或复杂网络设计。
In [163], the sliced Wasserstein discrepancy (SWD) is introduced for UDA, combining task-specific decision boundary alignment with Wasserstein distance. Validations include digit/sign recognition
[163]中引入了切片Wasserstein差异(Sliced Wasserstein Discrepancy,SWD)用于无监督领域自适应,结合了任务特定的决策边界对齐与Wasserstein距离。验证任务包括SYNSIG和GTSRB上的数字/标志识别
Selective adaptation for object detection (TDOD), leveraging domain gap metrics such as MMD, DSS, and SWD, is proposed in [164] to perform adaptation only when necessary. This approach minimizes costs while maintaining accuracy. On the DGTA benchmark, a no-adaptation model achieves
[164]提出了选择性适应的目标检测(TDOD),利用领域差距度量如MMD、DSS和SWD,仅在必要时执行适应。该方法在降低成本的同时保持准确性。在DGTA基准上,无适应模型在晴朗白天到阴天的场景中达到
Discrepancy-based DA aligns features across domains using metrics like MMD and Wasserstein distance, handling variations in layout, color, and environment. It bolsters robustness in Re-ID, tracking, and action recognition, aiding cross-camera tracking, behavior analysis, and intelligent traffic systems. However, these methods may struggle with complex shifts not fully captured by distance metrics and can be computationally intensive, requiring extensive tuning and resources. This complexity may limit their scalability in large, real-time systems.
基于差异的领域自适应通过MMD和Wasserstein距离等度量对齐跨域特征,处理布局、颜色和环境的变化。它增强了行人重识别、跟踪和动作识别的鲁棒性,支持跨摄像头跟踪、行为分析和智能交通系统。然而,这些方法可能难以应对距离度量无法完全捕捉的复杂偏移,且计算开销较大,需要大量调优和资源,限制了其在大规模实时系统中的可扩展性。
Adversarial-based DA is a technique for adapting a model trained on one domain, called the source domain, to achieve strong performance on a different domain, known as the target domain. This approach employs adversarial learning to reduce discrepancies between the two domains, allowing the model to generalize effectively. The core idea is to train a feature extractor that makes the data representations from both the source and target domains indistinguishable to a domain discriminator, while also ensuring the model performs well on the original source domain task.
基于对抗的领域自适应是一种将模型从源域迁移到目标域以实现良好性能的技术。该方法利用对抗学习减少两个域之间的差异,使模型能够有效泛化。核心思想是训练一个特征提取器,使源域和目标域的数据表示对域判别器不可区分,同时确保模型在源域任务上的表现良好。
Figure 15 illustrates the use of adversarial-based domain adaptation for segmenting a traffic scene. Let
图15展示了基于对抗的领域自适应在交通场景分割中的应用。设

FIGURE 15. Application of Adversarial-based DA to segmentation of a traffic scene: The figure shows a labeled source domain dataset and an unlabeled target domain dataset, where the source and target domains are captured under different weather conditions. The goal of this method is to achieve effective segmentation in the target domain, despite differences in data distribution between the source and target domains and variations in weather. The adversarial-based DA setup involves a shared-weight feature extractor and a domain adversarial training mechanism to align the feature spaces of both domains. The feature extractor is designed to map both source and target domain data into a shared feature space, minimizing domain-specific distinctions, including those caused by different weather conditions. The classifier predicts segmentation labels for the source domain, while the domain discriminator attempts to distinguish between source and target domain features. During training, the feature extractor is optimized to fool the domain discriminator, leading to more domain-invariant feature representations. Segmentation accuracy is further enhanced by utilizing insights from the source domain's class size distribution, which helps to regulate the constrained mutual information loss in the target domain. The combination of classification and adversarial feature losses are optimized to ensure that the segmentation model generalizes well to the target domain. Ultimately, this process results in accurate segmentation of the traffic scene, regardless of weather conditions.
图15. 基于对抗的领域自适应在交通场景分割中的应用:图中展示了一个带标签的源域数据集和一个无标签的目标域数据集,源域和目标域数据分别采集于不同的天气条件下。该方法的目标是在源域和目标域数据分布及天气条件存在差异的情况下,实现目标域的有效分割。基于对抗的领域自适应设置包括一个共享权重的特征提取器和一个领域对抗训练机制,用以对齐两个域的特征空间。特征提取器旨在将源域和目标域数据映射到共享特征空间,最小化域特异性差异,包括由不同天气条件引起的差异。分类器预测源域的分割标签,而领域判别器试图区分源域和目标域的特征。在训练过程中,特征提取器被优化以欺骗领域判别器,从而获得更具域不变性的特征表示。通过利用源域类别大小分布的信息,进一步提升分割准确率,这有助于调节目标域中的受限互信息损失。分类损失与对抗特征损失的结合被优化,以确保分割模型在目标域上的良好泛化。最终,该过程实现了对交通场景的准确分割,无论天气条件如何变化。
Adversarial-based DA typically involves a feature extractor
基于对抗的领域自适应通常包括一个特征提取器
The objective for the classifier and feature extractor on the source domain is to minimize the classification loss
分类器和特征提取器在源域上的目标是最小化分类损失
where
其中
The domain discriminator
领域判别器
where
其中
To ensure
为了确保
The overall loss combines the source classification loss and the adversarial domain confusion loss:
整体损失结合了源域分类损失和对抗领域混淆损失:
where
其中
Training alternates between two steps: training
训练在两个步骤之间交替进行:训练
I2IT, introduced in [165], converts an image from one domain to another, preserving core structure but adapting style or characteristics. It's used in photo enhancement, style transfer, and data augmentation. Though not inherently a DA task, I2IT methods are widely applied for DA ( [166], [167]), often using GAN-based frameworks [124]. CycleGAN [137] is a key milestone in I2IT and underpins many adversarial DA approaches in Table 4.
I2IT(图像到图像转换),如文献[165]所述,将图像从一个域转换到另一个域,保持核心结构不变,同时调整风格或特征。它被用于照片增强、风格迁移和数据增强。尽管本质上不是域适应(DA)任务,I2IT方法被广泛应用于DA([166],[167]),通常采用基于生成对抗网络(GAN)的框架[124]。CycleGAN[137]是I2IT领域的重要里程碑,支撑了表4中许多对抗域适应方法。
CycleGAN introduces a cycle consistency loss to ensure that the original image can be recovered after a round-trip translation
CycleGAN引入了循环一致性损失,以确保经过往返转换后能够恢复原始图像
where
其中
Adversarial-based DA has been applied in traffic scene understanding for tasks like day-to-night translation [168], haze synthesis and removal [148], semantic segmentation [169], object detection [170], [171], [172], [173], scene classification [174], [175], scene segmentation [176], and cross-domain adaptation in challenging environments [177], [178], [179], [180], [181], [182]. Moreover, these methods contribute to fair scene adaptation in urban monitoring [183].
基于对抗的域适应已应用于交通场景理解中的多项任务,如昼夜转换[168]、雾霾合成与去除[148]、语义分割[169]、目标检测[170],[171],[172],[173]、场景分类[174],[175]、场景分割[176]以及复杂环境下的跨域适应[177],[178],[179],[180],[181],[182]。此外,这些方法还促进了城市监控中的公平场景适应[183]。
The Fréchet Inception Distance (FID) score measures similarity between real and generated images by calculating the Fréchet (Wasserstein-2) distance between their multivariate Gaussian distributions, based on means and covariances. It uses features from an intermediate layer of the Inception network [184], capturing both visual quality and diversity.
Fréchet Inception Distance(FID)分数通过计算真实图像和生成图像的多元高斯分布之间的Fréchet(Wasserstein-2)距离来衡量它们的相似性,该距离基于均值和协方差。它利用Inception网络[184]中间层的特征,综合反映视觉质量和多样性。
The adversarial method in [168] (referred to as "Day-toNight" in our work) uses CycleGAN [137] for day-to-night translation, enhanced by transfer learning from semantic segmentation. Trained on BDD segmentation data and adapted to Tokyo, it handles unique lighting conditions, improving object detection (
文献[168]中的对抗方法(在本文中称为“昼夜转换”)使用CycleGAN[137]进行昼夜图像转换,并通过语义分割的迁移学习进行增强。该方法在BDD分割数据上训练并适配东京场景,处理独特的光照条件,提升了目标检测(
ParaTeacher [179] introduces a UDA approach combining a Style-Content Discriminated Data Recombination (SCD-DR) module for data refinement and an Iterative Cross-Domain Knowledge Transferring (ICD-KT) module for knowledge enhancement. Integrated with Faster R-CNN, it boosts mAP by
ParaTeacher[179]提出了一种结合风格-内容区分数据重组(SCD-DR)模块进行数据精炼和迭代跨域知识转移(ICD-KT)模块进行知识增强的无监督域适应(UDA)方法。该方法集成于Faster R-CNN中,提升了mAP
PanopticGAN [180] proposes a GAN framework for panoptic-aware I2IT, improving image quality and object recognition with a feature masking module and a compact thermal dataset. It enhances boundary sharpness and segmentation, achieving superior fidelity and an FID score of 69.4, outperforming existing methods.
PanopticGAN[180]提出了一种面向全景感知的I2IT生成对抗网络框架,通过特征掩码模块和紧凑的热成像数据集提升图像质量和目标识别能力。该方法增强了边界清晰度和分割效果,实现了优异的保真度,FID分数为69.4,优于现有方法。
The CyCADA model [181] combines discriminative training with cycle-consistent adversarial DA at pixel and feature levels, requiring no aligned image pairs. It proves effective in semantic segmentation,achieving
CyCADA模型[181]结合了判别训练和像素级及特征级的循环一致对抗域适应,无需对齐的图像对。该模型在语义分割任务中表现出色,实现了GTA5到CityScapes适配中的
The model in [174] (referred to as "UDAofUrbanScenes" in our work) combines supervised learning on synthetic data, adversarial learning between labeled synthetic and unlabeled real data, and self-teaching guided by segmentation confidence. It adapts urban scene segmentation from synthetic datasets (GTA5, SYNTHIA) to real-world datasets (Cityscapes), improving performance on rare classes and achieving 30.2% mIoU.
文献[174]中的模型(本文称为“城市场景UDA”)结合了合成数据的监督学习、带标签合成数据与无标签真实数据之间的对抗学习,以及由分割置信度引导的自我教学。该方法实现了从合成数据集(GTA5,SYNTHIA)到真实数据集(Cityscapes)的城市场景分割适应,提升了稀有类别的性能,mIoU达到30.2%。
CPGAN [172] enhances vehicle detection in foggy conditions with an improved CycleGAN [137] for style transfer and pre-trained YOLOv4,achieving
CPGAN [172] 通过改进的CycleGAN [137]进行风格迁移和预训练的YOLOv4,提升了雾天条件下的车辆检测,在HVFD(正常到雾天)上实现了
The I2IT model [182] (referred to as "SGND" in our work) introduces a multi-task unsupervised NN for day-to-night translation using adversarial training. It combines semantic segmentation and geometric depth as spatial attention maps on the BDD dataset. Featuring a generator for conversion and a discriminator for realism, SGND achieves an FID score of 31.245, superior KID metrics, and improved realism, accuracy, and domain mapping.
I2IT模型[182](在本工作中称为“SGND”)引入了一种多任务无监督神经网络,利用对抗训练实现昼夜转换。该模型结合了语义分割和几何深度作为BDD数据集上的空间注意力图。SGND包含一个转换生成器和一个真实性判别器,取得了31.245的FID分数、更优的KID指标,以及更高的真实感、准确性和域映射效果。

FIGURE 16. Training procedure of adversarial feature adaptation for traffic scene classification. The process involves three key steps: In Step 0, the source-specific feature extractor (E5) and classifier (C) are trained with source images to minimize classification errors via cross-entropy (CE) loss. In Step 1, the feature generator (S) and discriminator (D₁) are trained to produce domain-specific generated features, using noise conditioning and source labels to enhance the adaptation capability. In Step 2, the shared encoder (E,I) and discriminator (D
图16. 用于交通场景分类的对抗特征适应训练流程。该过程包含三个关键步骤:步骤0中,源域特定特征提取器(E5)和分类器(C)通过交叉熵(CE)损失在源图像上训练以最小化分类误差。步骤1中,特征生成器(S)和判别器(D₁)训练生成域特定特征,利用噪声条件和源标签增强适应能力。步骤2中,共享编码器(E,I)和判别器(D
An innovative unsupervised I2IT framework is introduced in [148] that leverages both VAE and GAN, along with an MMD-based VAE, which utilizes MMD as a discrepancy measure to align latent distributions effectively. This discrepancy-based alignment allows the framework to handle both haze image synthesis and haze removal in a unified manner, demonstrating promising results on the Apollo dataset, with PSNR and SSIM metrics of 27.3772 and 0.9271, respectively
文献[148]提出了一种创新的无监督I2IT框架,结合了变分自编码器(VAE)和生成对抗网络(GAN),以及基于最大均值差异(MMD)的VAE,利用MMD作为差异度量有效对齐潜在分布。基于差异的对齐使该框架能够统一处理雾图像合成和去雾任务,在Apollo数据集上表现出良好效果,PSNR和SSIM指标分别为27.3772和0.9271。
In [169], a novel teacher-student [156] approach for unsupervised domain-adaptive semantic segmentation in memory-constrained models (referred to as DRN-D-BasedDA in our work) is presented. The method employs a multi-level strategy with adversarial learning and uses a custom cross-entropy loss with pseudo-teacher labels to address domain gaps and memory constraints. DRN-D-BasedDA improves adaptability in both real and synthetic scenarios,achieving an mIoU of
文献[169]提出了一种新颖的教师-学生[156]方法,用于内存受限模型中的无监督域自适应语义分割(本工作称为DRN-D-BasedDA)。该方法采用多层次策略结合对抗学习,使用带伪教师标签的自定义交叉熵损失以解决域差异和内存限制问题。DRN-D-BasedDA提升了真实和合成场景下的适应性,实现了从GTA5到Cityscapes的mIoU为
Adversarial Feature Adaptation (AFA), as first introduced in [187], is a technique for UDA that enhances model robustness and generalization by augmenting training data with adversarially generated features. It employs a domain-invariant feature extractor trained via feature space data augmentation, utilizing GANs to broaden the input feature distribution. This method aims to improve the model's generalization to unseen data, especially useful in scenarios requiring resilience to adversarial examples or when training data is scarce or non-representative. Like with I2IT adversarial models, it has been applied to overcome domain shift problems for various traffic vision tasks including object detection [170], [171], traffic scene classification [175], and traffic scene segmentation [176]. Figure 16 shows the training procedure of an AFA to be applied to a traffic scene classification problem.
对抗特征适应(AFA)首次在文献[187]中提出,是一种用于无监督域自适应(UDA)的技术,通过对抗生成的特征增强训练数据,提升模型的鲁棒性和泛化能力。该方法利用域不变特征提取器,通过特征空间数据增强和GAN扩展输入特征分布,旨在提升模型对未见数据的泛化能力,尤其适用于需要抵抗对抗样本或训练数据稀缺/不具代表性的场景。与I2IT对抗模型类似,AFA已应用于解决多种交通视觉任务的域偏移问题,包括目标检测[170],[171]、交通场景分类[175]和交通场景分割[176]。图16展示了应用于交通场景分类问题的AFA训练流程。
AFAN [170] merges an advanced UDA framework for object detection with an intermediate domain image generator and domain-adversarial training with soft domain labels, significantly enhancing feature alignment through feature pyramid and region feature alignment techniques. This comprehensive approach fosters domain-invariant feature learning and achieves a notable mAP of
AFAN [170]融合了先进的无监督域自适应框架,用于目标检测,结合中间域图像生成器和带软域标签的域对抗训练,通过特征金字塔和区域特征对齐技术显著增强了特征对齐能力。该综合方法促进了域不变特征学习,在从CityScapes到KITTI基准的目标检测中实现了显著的mAP为
SADA (Sparse Adversarial Domain Adaptation) [175] tackles weather-related domain shift in traffic scene classification [175]. With 93.20% accuracy (Sunny to Cloudy) on HSD dataset, it employs a unique sparse adversarial deep NN. This model captures sparse data from source scenes and aligns them with target images, extracting domain-invariant features for accurate classification. SADA outperforms existing methods, showcasing the power of sparse data and adversarial domain alignment in deep networks.
SADA(稀疏对抗域自适应)[175]解决了交通场景分类中的天气相关域偏移问题。在HSD数据集上(晴天到多云)实现了93.20%的准确率,采用独特的稀疏对抗深度神经网络。该模型捕捉源场景的稀疏数据并与目标图像对齐,提取域不变特征以实现准确分类。SADA优于现有方法,展示了稀疏数据和对抗域对齐在深度网络中的强大效能。
The study in [176] introduces an innovative UDA model employing a sparse adversarial multi-target approach to address domain shifts in real-world traffic scenes. Achieving a segmentation accuracy of 76.13 IoU on the ACDC dataset, it outperforms state-of-the-art methods, demonstrating the effectiveness of sparse representation compared to deep dense alternatives under diverse environmental conditions.
文献[176]提出了一种创新的无监督域自适应模型,采用稀疏对抗多目标方法应对真实交通场景中的域偏移。在ACDC数据集上实现了76.13的IoU分割准确率,优于最先进方法,证明了稀疏表示相较于深度密集方法在多样环境条件下的有效性。
The approach proposed in [171] for handling foggy and rainy conditions combines image and object-level adaptations with an adversarial gradient reversal layer to mine challenging examples. Additionally, it employs an auxiliary domain via data augmentation to introduce new domain-level metric regularization. This method achieves a detection mAP of
[171]中提出的处理雾天和雨天条件的方法结合了图像级和目标级的适应,并通过对抗性梯度反转层挖掘具有挑战性的样本。此外,该方法通过数据增强引入辅助域,实现新的域级度量正则化。该方法在从CityScapes迁移到Rainy CityScapes时达到
FREDOM [183] addresses fairness in DA for semantic scene understanding, leveraging transformer networks [188] to model conditional structures and balance class distributions. By utilizing self-supervised loss with pseudo labels and introducing a conditional structural constraint, it achieves mIoU accuracies of
FREDOM [183]针对语义场景理解中的域适应公平性问题,利用变换器网络(transformer networks)[188]建模条件结构并平衡类别分布。通过使用带伪标签的自监督损失并引入条件结构约束,实现了SYNTHIA
The Self-Adversarial Disentangling (SAD) framework, proposed in [178], addresses the challenge of adapting to specific domain shifts in DA by introducing the concept of Specific DA (SDA) and mitigating intra-domain gaps through a domainness creator and self-adversarial regularizer, achieving
[178]提出的自对抗解耦(Self-Adversarial Disentangling,SAD)框架,通过引入特定域适应(Specific DA,SDA)概念,并通过域特性生成器和自对抗正则器缓解域内差距,解决了域适应中特定域偏移的挑战,在Cityscapes到Foggy Cityscapes的基准测试中取得了
The authors of [177] proposed Category-induced Coarse-to-Fine DA (C2FDA) to address the challenges of adapting object detection models to unseen and complex traffic environments. They introduced three key components: Attention-induced Coarse-Grained Alignment (ACGA), Attention-induced Feature Selection, and Category-induced Fine-Grained Alignment (CFGA). Their approach achieved 48.9% AP on synthetic-to-real adaptation (SIM10K to Cityscapes),
[177]的作者提出了类别引导的粗到细域适应(Category-induced Coarse-to-Fine DA,C2FDA),以应对目标检测模型在未知且复杂交通环境中的适应难题。该方法引入了三个关键组件:注意力引导的粗粒度对齐(ACGA)、注意力引导的特征选择和类别引导的细粒度对齐(CFGA)。其方法在合成到真实的适应(SIM10K到Cityscapes)中达到48.9%的AP,在天气适应(Cityscapes到Foggy Cityscapes)中达到
DAAF (Domain Adaptation of Anchor-Free) object detection method [173], tackles the challenges of cross-domain object detection in complex urban traffic scenarios. It utilizes fully convolutional adversarial training for global feature alignment and incorporates Pixel-Level Adaptation (PLA) for local feature alignment. This approach achieves an AP50 of
DAAF(无锚域适应)目标检测方法[173],解决了复杂城市交通场景中的跨域目标检测挑战。该方法利用全卷积对抗训练实现全局特征对齐,并结合像素级适应(Pixel-Level Adaptation,PLA)实现局部特征对齐。在从SIM
Adversarial-based DA enhances traffic tasks like reidentification, tracking, and action recognition by using adversarial training to create domain-invariant features, minimizing domain-specific biases and enabling robust performance across varying conditions. This supports applications such as traffic flow optimization, anomaly detection, and cross-camera tracking, vital for autonomous driving and intelligent traffic systems. However, in traffic scene understanding, adversarial-based DA requires careful tuning to balance adversarial loss with task-specific accuracy, as misalignment can lead to incorrect identification or tracking. It may also struggle in rapidly changing traffic environments, where maintaining consistent feature alignment is challenging, impacting the reliability of tracking and action recognition, especially in dense, dynamic traffic scenarios.
基于对抗的域适应通过对抗训练生成域不变特征,减少域特定偏差,提升了交通任务如重识别、跟踪和动作识别的性能,实现了跨条件的鲁棒表现。这支持了交通流优化、异常检测和跨摄像头跟踪等应用,对于自动驾驶和智能交通系统至关重要。然而,在交通场景理解中,基于对抗的域适应需要谨慎调节对抗损失与任务特定准确率的平衡,因不当对齐可能导致错误识别或跟踪。此外,在快速变化的交通环境中,维持一致的特征对齐较为困难,影响跟踪和动作识别的可靠性,尤其是在密集且动态的交通场景中。
HPO enhances the performance of clustering-based, discrepancy-based, and adversarial DA models, especially for tasks like Person Re-ID, object detection, and semantic segmentation. By fine-tuning hyperparameters such as learning rates, loss coefficients, and architectural choices, HPO optimizes model components for effectiveness across domains.
超参数优化(HPO)提升了基于聚类、基于差异和基于对抗的域适应模型的性能,尤其适用于行人重识别、目标检测和语义分割等任务。通过微调学习率、损失系数和架构选择等超参数,HPO优化了模型组件以实现跨域的有效性。
In clustering-based DA methods, optimizing network architectures, loss weights, learning rates, and data augmentation strategies improves pseudo-label reliability and domain alignment. For instance, in CDCL [149], HPO balances learning rates, weight decay, and contrastive loss temperature-a hyperparameter that scales the similarity measure in contrastive loss, fine-tuning feature differentiation. This results in improved feature extraction and an mAP of 81.5% on DukeMTMC-ReID to Market1501. DMD [150] benefits from tuning graph learning rates, distillation weights, which control the influence of knowledge transfer from a teacher model to a student model, and batch size, achieving an mAP of 92.7%. In object detection, ConfMix [151] adjusts adversarial parameters, such as pseudo-label confidence (0.7 to 0.9), which sets a threshold for pseudo-label acceptance to enhance reliability, and NMS thresholds (0.3- 0.5 ),achieving an mAP of
在基于聚类的领域自适应(DA)方法中,优化网络架构、损失权重、学习率和数据增强策略能够提升伪标签的可靠性和领域对齐效果。例如,在CDCL [149]中,超参数优化(HPO)平衡了学习率、权重衰减和对比损失温度——一个用于缩放对比损失中相似度度量的超参数,微调特征区分能力。这提升了特征提取效果,并在DukeMTMC-ReID到Market1501的迁移中实现了81.5%的mAP。DMD [150]通过调节图学习率、蒸馏权重(控制教师模型向学生模型知识转移的影响)和批量大小,达到了92.7%的mAP。在目标检测中,ConfMix [151]调整了对抗参数,如伪标签置信度(0.7到0.9),该阈值用于伪标签接受以增强可靠性,以及非极大值抑制(NMS)阈值(0.3-0.5),实现了从KITTI到Cityscapes的mAP为
In discrepancy-based DA models, HPO optimizes domain alignment metrics like FID and MMD. In ParaTeacher [179], HPO fine-tunes modules by adjusting alignment coefficients (0.1 to 0.5) and contrastive temperatures (0.07 to 0.1), improving mAP by
在基于差异的领域自适应模型中,超参数优化(HPO)用于优化领域对齐指标,如FID和MMD。在ParaTeacher [179]中,HPO通过调整对齐系数(0.1到0.5)和对比温度(0.07到0.1)微调模块,使KITTI数据集上的mAP提升了
In adversarial-based DA models, HPO plays a crucial role in refining parameters for adversarial losses, feature alignment, and I2IT techniques. In [148], a combined I2IT framework utilizes a VAE-GAN structure with an MMD-based VAE. Here, MMD serves as an effective discrepancy measure to align latent distributions, while HPO is used to balance reconstruction loss (ranging from 0.2 to 0.5 ) and adversarial weights. This approach achieved PSNR and SSIM scores of 27.3772 and 0.9271, respectively, on the Apollo dataset.
在基于对抗的领域自适应模型中,超参数优化(HPO)在细化对抗损失、特征对齐和图像到图像转换(I2IT)技术的参数方面起着关键作用。在[148]中,结合的I2IT框架采用了带有基于MMD的变分自编码器(VAE)的VAE-GAN结构。这里,MMD作为一种有效的差异度量用于对齐潜在分布,同时HPO用于平衡重建损失(范围0.2到0.5)和对抗权重。该方法在Apollo数据集上分别实现了27.3772的峰值信噪比(PSNR)和0.9271的结构相似性指数(SSIM)。
TABLE 4. A comprehensive comparison of various domain adaptive ML models applied to traffic scene understanding, highlighting the differences in applications, categories, variance across models, datasets utilized, performance metrics, and the resulting effectiveness in their respective applications.
表4. 各类应用于交通场景理解的领域自适应机器学习模型的综合比较,重点展示了应用差异、类别、模型间的变异性、所用数据集、性能指标及其在各自应用中的效果。
| Application | Category | Variance | Dataset | Performance Metric | Result |
| Classification | Discrepancy | DAN [161] | Cityscapes \( \rightarrow \) Foggy Cityscapes | Mean Accuracy | 91.85% |
| Discrepancy | ML-ANet [161] | Cityscapes \( \rightarrow \) Foggy Cityscapes | Mean Accuracy | 94.83% | |
| Discrepancy | MCD [163] | VisDA 2017 \( \rightarrow \) MSCOCO | Mean Accuracy | 71.9% | |
| Discrepancy | SWD [163] | VisDA 2017 \( \rightarrow \) MSCOCO | Mean Accuracy | 76.4% | |
| Discrepancy | D-MMD [175] | HSD(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy) | Mean Accuracy | 72.63% | |
| Adversarial | STAR [175] | HSD(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy) | Mean Accuracy | 81.25% | |
| Adversarial | DWL [175] | HSD(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy) | Mean Accuracy | 82.38% | |
| Adversarial | SADA [175] | \( \mathrm{{HSD}}\left( {\mathrm{{Sunny}} \rightarrow \mathrm{{Cloudy}}/\mathrm{{Rainy}}/\mathrm{{Snowy}}}\right) \) | Mean Accuracy | 93.20% | |
| Object Detection | Clustering | CFFA [153] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 38.6% |
| Discrepancy | DefDETR [159] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 28.5% | |
| Adversarial | MTTrans [159] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 43.4% | |
| Discrepancy | MRT [159] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 51.2% | |
| Discrepancy | Deformable DETR [160] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 28.6% | |
| Adversarial | SFA [160] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 41.3% | |
| Adversarial | \( {O}^{2} \) net [160] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 46.8% | |
| Discrepancy | TDOD (Without Adaptation) [164] | DGTA(Clear \( \rightarrow \) Overcast) | AP50 | 90.3% | |
| Discrepancy | TDOD (With Adaptation) [164] | DGTA(Clear \( \rightarrow \) Overcast) | AP50 | 93.1% | |
| Adversarial | DaytoNight-No Augmentation [168] | BDD(Day \( \rightarrow \) Night) | mAP | 55.3% | |
| Adversarial | DaytoNight-With Augmentation [168] | BDD(Day \( \rightarrow \) Night) | mAP | 57.2% | |
| Adversarial | AFAN [170] | CityScapes \( \rightarrow \) KITTI | mAP | 41.4% | |
| Adversarial | FogAndRainDA [171] | CityScapes \( \rightarrow \) Rainy CityScapes | mAP | 45.0% | |
| Adversarial | YOLOv4+CycleGAN [172] | HVFD(Normal \( \rightarrow \) Foggy) | mAP50 | 67.21% | |
| Adversarial | YOLOv4+CPGAN [172] | HVFD(Normal \( \rightarrow \) Foggy) | mAP50 | 69.24% | |
| Adversarial | MGA [173] | SIM \( {10}\mathrm{\;K} \rightarrow \) Cityscapes | AP50 | 49.8% | |
| Adversarial | DAAF [173] | SIM \( {10}\mathrm{\;K} \rightarrow \) Cityscapes | AP50 | 53.4% | |
| Adversarial | C2FDA [177] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 40.5% | |
| Adversarial | SAD [178] | Cityscapes \( \rightarrow \) Foggy Cityscapes | mAP | 45.2% | |
| Discrepancy | MTOR [179] | Virtual KITTI \( \rightarrow \) KITTI | mAP | 32.75% | |
| Adversarial | ParaTeacher [179] | Virtual KITTI \( \rightarrow \) KITTI | mAP | 44.59% | |
| Segmentation | Clustering | FFREEDA [150] | GTA5 \( \rightarrow \) Mapillary | mloU | \( {40.16} \pm {1.02} \) |
| Discrepancy | SWD [163] | GTA5 \( \rightarrow \) Cityscapes | mIoU | 44.5% | |
| Adversarial | DaytoNight-No Augmentation [168] | BDD(Day \( \rightarrow \) Night) | mloU | 59.5% | |
| Adversarial | DaytoNight-With Augmentation [168] | BDD(Day \( \rightarrow \) Night) | mIoU | 61.6% | |
| Adversarial | AdaptSegNet [169] | GTA5 \( \rightarrow \) Cityscapes | mloU | 32.49% | |
| Adversarial | DRN-D-BasedDA [169] | GTA5 \( \rightarrow \) Cityscapes | mloU | 37.35% | |
| Adversarial | UDAofUrbanScenes [174] | GTA5 \( \rightarrow \) CityScapes | mloU | 30.2% | |
| Adversarial | MTKT [176] | ACDC(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy) | IoU | 71.01% | |
| Adversarial | LSA-UDA [176] | ACDC(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy) | IoU | 76.13% | |
| Adversarial | CyCADA feature-only [181] | SYNTHIA \( \rightarrow \) CityScapes | mIoU | 31.7% | |
| Adversarial | CyCADA pixel-only [181] | SYNTHIA \( \rightarrow \) CityScapes | mloU | 37.0% | |
| Adversarial | CyCADA pixel+feature [181] | SYNTHIA \( \rightarrow \) CityScapes | mloU | 39.5% | |
| Adversarial | FREDOM [183] | GTA5 \( \rightarrow \) CityScapes | mIoU | 73.6% | |
| 12IT | Adversarial | UNIT [148] | Apollo(Haze \( \rightarrow \) Dehaze) | PSNR, SSIM | 24.52, 0.85 |
| Adversarial | CycleGAN [148] | Apollo(Haze \( \rightarrow \) Dehaze) | PSNR, SSIM | 25.19, 0.89 | |
| Adversarial | VAE-GAN [148] | Apollo(Haze \( \rightarrow \) Dehaze) | PSNR, SSIM | 27.38, 0.93 | |
| Adversarial | \( \mathrm{{AugGAN}}\left\lbrack {168}\right\rbrack \) | BDD(Day \( \rightarrow \) Night) | FID | 67.07 | |
| Adversarial | SemGAN [168] | BDD(Day \( \rightarrow \) Night) | FID | 39.91 | |
| Adversarial | DaytoNight [168] | BDD(Day \( \rightarrow \) Night) | FID | 39.26 | |
| Adversarial | CycleGAN [168] | BDD(Day \( \rightarrow \) Night) | FID | 35.28 | |
| Adversarial | MUNIT+Seg [180] | Augmented KAIST-MSBDD(Day \( \rightarrow \) Night) | FID | 98.7 | |
| Adversarial | BicycleGAN+Seg [180] | Augmented KAIST-MSBDD(Day \( \rightarrow \) Night) | FID | 97.9 | |
| Adversarial | SCGAN [180] | Augmented KAIST-MSBDD(Day \( \rightarrow \) Night) | FID | 92.4 | |
| Adversarial | TSIT+Seg [180] | Augmented KAIST-MSBDD(Day \( \rightarrow \) Night) | FID | 80.8 | |
| Adversarial | INIT [180] | Augmented KAIST-MSBDD(Day \( \rightarrow \) Night) | FID | 76.7 | |
| Adversarial | PanopticGAN [180] | Augmented KAIST-MSBDD(Day \( \rightarrow \) Night) | FID | 69.4 | |
| Adversarial | CycleGAN [182] | BDD (Day \( \rightarrow \) Night) | FID | 35.52 | |
| Adversarial | SemGAN [182] | BDD(Day \( \rightarrow \) Night) | FID | 35.26 | |
| Adversarial | AugGAN [182] | BDD(Day \( \rightarrow \) Night) | FID | 57.72 | |
| Adversarial | UNIT [182] | BDD(Day \( \rightarrow \) Night) | FID | 32.66 | |
| Adversarial | MUNIT [182] | BDD(Day \( \rightarrow \) Night) | FID | 69.97 | |
| Adversarial | SGND [182] | BDD(Day \( \rightarrow \) Night) | FID | 31.25 | |
| Person Re-ID | Clustering | CDCL [149] | \( \mathrm{{DukeMTMC} - {ReID}} \rightarrow \mathrm{{Market1501}} \) | mAP | 81.5% |
| Clustering | DMD [150] | DukeMTMC-reID \( \rightarrow \) Market1501 | mAP | 92.7% | |
| Discrepancy | D-MMD [162] | DukeMTMC \( \rightarrow \) Market1501 | mAP | 48.8% |
| 应用 | 类别 | 方差 | 数据集 | 性能指标 | 结果 |
| 分类 | 差异 | DAN [161] | Cityscapes \( \rightarrow \) 雾天Cityscapes | 平均准确率 | 91.85% |
| 差异 | ML-ANet [161] | Cityscapes \( \rightarrow \) 雾天Cityscapes | 平均准确率 | 94.83% | |
| 差异 | MCD [163] | VisDA 2017 \( \rightarrow \) MSCOCO | 平均准确率 | 71.9% | |
| 差异 | SWD [163] | VisDA 2017 \( \rightarrow \) MSCOCO | 平均准确率 | 76.4% | |
| 差异 | D-MMD [175] | HSD(晴天 \( \rightarrow \) 多云/雨天/雪天) | 平均准确率 | 72.63% | |
| 对抗 | STAR [175] | HSD(晴天 \( \rightarrow \) 多云/雨天/雪天) | 平均准确率 | 81.25% | |
| 对抗 | DWL [175] | HSD(晴天 \( \rightarrow \) 多云/雨天/雪天) | 平均准确率 | 82.38% | |
| 对抗 | SADA [175] | \( \mathrm{{HSD}}\left( {\mathrm{{Sunny}} \rightarrow \mathrm{{Cloudy}}/\mathrm{{Rainy}}/\mathrm{{Snowy}}}\right) \) | 平均准确率 | 93.20% | |
| 目标检测 | 聚类 | CFFA [153] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 38.6% |
| 差异 | DefDETR [159] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 28.5% | |
| 对抗 | MTTrans [159] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 43.4% | |
| 差异 | MRT [159] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 51.2% | |
| 差异 | 可变形DETR [160] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 28.6% | |
| 对抗 | SFA [160] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 41.3% | |
| 对抗 | \( {O}^{2} \) 网络 [160] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 46.8% | |
| 差异 | TDOD(无适应)[164] | DGTA(晴朗 \( \rightarrow \) 阴天) | AP50 | 90.3% | |
| 差异 | TDOD(有适应)[164] | DGTA(晴朗 \( \rightarrow \) 阴天) | AP50 | 93.1% | |
| 对抗 | DaytoNight-无增强 [168] | BDD(日间 \( \rightarrow \) 夜间) | mAP | 55.3% | |
| 对抗 | DaytoNight-有增强 [168] | BDD(日间 \( \rightarrow \) 夜间) | mAP | 57.2% | |
| 对抗 | AFAN [170] | CityScapes \( \rightarrow \) KITTI | mAP | 41.4% | |
| 对抗 | FogAndRainDA [171] | CityScapes \( \rightarrow \) 雨天Cityscapes | mAP | 45.0% | |
| 对抗 | YOLOv4+CycleGAN [172] | HVFD(正常 \( \rightarrow \) 雾天) | mAP50 | 67.21% | |
| 对抗 | YOLOv4+CPGAN [172] | HVFD(正常 \( \rightarrow \) 雾天) | mAP50 | 69.24% | |
| 对抗 | MGA [173] | SIM \( {10}\mathrm{\;K} \rightarrow \) Cityscapes | AP50 | 49.8% | |
| 对抗 | DAAF [173] | SIM \( {10}\mathrm{\;K} \rightarrow \) Cityscapes | AP50 | 53.4% | |
| 对抗 | C2FDA [177] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 40.5% | |
| 对抗 | SAD [178] | Cityscapes \( \rightarrow \) 雾天Cityscapes | mAP | 45.2% | |
| 差异 | MTOR [179] | 虚拟KITTI \( \rightarrow \) KITTI | mAP | 32.75% | |
| 对抗 | ParaTeacher [179] | 虚拟KITTI \( \rightarrow \) KITTI | mAP | 44.59% | |
| 分割 | 聚类 | FFREEDA [150] | GTA5 \( \rightarrow \) Mapillary | mloU | \( {40.16} \pm {1.02} \) |
| 差异 | SWD [163] | GTA5 \( \rightarrow \) Cityscapes | mIoU | 44.5% | |
| 对抗 | DaytoNight-无增强 [168] | BDD(日间 \( \rightarrow \) 夜间) | mloU | 59.5% | |
| 对抗 | DaytoNight-有增强 [168] | BDD(日间 \( \rightarrow \) 夜间) | mIoU | 61.6% | |
| 对抗 | AdaptSegNet [169] | GTA5 \( \rightarrow \) Cityscapes | mloU | 32.49% | |
| 对抗 | 基于DRN的DA [169] | GTA5 \( \rightarrow \) Cityscapes | mloU | 37.35% | |
| 对抗 | 城市场景的UDA [174] | GTA5 \( \rightarrow \) CityScapes | mloU | 30.2% | |
| 对抗 | MTKT [176] | ACDC(晴天\( \rightarrow \)多云/雨天/雪天) | 交并比(IoU) | 71.01% | |
| 对抗 | LSA-UDA [176] | ACDC(晴天\( \rightarrow \)多云/雨天/雪天) | 交并比(IoU) | 76.13% | |
| 对抗 | 仅特征的CyCADA [181] | SYNTHIA \( \rightarrow \) CityScapes | mIoU | 31.7% | |
| 对抗 | 仅像素的CyCADA [181] | SYNTHIA \( \rightarrow \) CityScapes | mloU | 37.0% | |
| 对抗 | 像素+特征的CyCADA [181] | SYNTHIA \( \rightarrow \) CityScapes | mloU | 39.5% | |
| 对抗 | FREDOM [183] | GTA5 \( \rightarrow \) CityScapes | mIoU | 73.6% | |
| 12IT | 对抗 | UNIT [148] | Apollo(雾霾\( \rightarrow \)去雾) | 峰值信噪比(PSNR),结构相似性指数(SSIM) | 24.52, 0.85 |
| 对抗 | CycleGAN [148] | Apollo(雾霾\( \rightarrow \)去雾) | 峰值信噪比(PSNR),结构相似性指数(SSIM) | 25.19, 0.89 | |
| 对抗 | 变分自编码器生成对抗网络(VAE-GAN)[148] | Apollo(雾霾\( \rightarrow \)去雾) | 峰值信噪比(PSNR),结构相似性指数(SSIM) | 27.38, 0.93 | |
| 对抗 | \( \mathrm{{AugGAN}}\left\lbrack {168}\right\rbrack \) | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 67.07 | |
| 对抗 | SemGAN [168] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 39.91 | |
| 对抗 | 昼夜转换(DaytoNight)[168] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 39.26 | |
| 对抗 | CycleGAN [168] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 35.28 | |
| 对抗 | MUNIT+分割(Seg)[180] | 增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚) | 弗雷歇特征距离(FID) | 98.7 | |
| 对抗 | BicycleGAN+分割(Seg)[180] | 增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚) | 弗雷歇特征距离(FID) | 97.9 | |
| 对抗 | SCGAN [180] | 增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚) | 弗雷歇特征距离(FID) | 92.4 | |
| 对抗 | TSIT+分割(Seg)[180] | 增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚) | 弗雷歇特征距离(FID) | 80.8 | |
| 对抗 | INIT [180] | 增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚) | 弗雷歇特征距离(FID) | 76.7 | |
| 对抗 | 全景生成对抗网络(PanopticGAN)[180] | 增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚) | 弗雷歇特征距离(FID) | 69.4 | |
| 对抗 | CycleGAN [182] | BDD(白天\( \rightarrow \)夜晚) | 弗雷歇特征距离(FID) | 35.52 | |
| 对抗 | SemGAN [182] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 35.26 | |
| 对抗 | AugGAN [182] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 57.72 | |
| 对抗 | UNIT [182] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 32.66 | |
| 对抗 | MUNIT [182] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 69.97 | |
| 对抗 | SGND [182] | BDD(日间 \( \rightarrow \) 夜间) | 弗雷歇特征距离(FID) | 31.25 | |
| 行人重识别(Person Re-ID) | 聚类 | CDCL [149] | \( \mathrm{{DukeMTMC} - {ReID}} \rightarrow \mathrm{{Market1501}} \) | mAP | 81.5% |
| 聚类 | DMD [150] | DukeMTMC-reID \( \rightarrow \) Market1501 | mAP | 92.7% | |
| 差异 | D-MMD [162] | DukeMTMC \( \rightarrow \) Market1501 | mAP | 48.8% |
In semantic segmentation, the adversarial framework FRE-DOM [183] employs transformer-based networks, requiring careful HPO of hyperparameters such as learning rates (set at
在语义分割中,对抗框架FRE-DOM [183]采用基于变换器(transformer-based)的网络,需对学习率(设定为
For day-to-night translation tasks, adversarial models based on CycleGAN [137], such as the model in [168], benefit significantly from HPO, particularly through tuning the cyclic consistency loss parameter
对于昼夜转换任务,基于CycleGAN [137]的对抗模型,如文献[168]中的模型,通过超参数优化(HPO)显著受益,特别是在调整循环一致性损失参数
A comparison of different categories of DA models is presented in Table 4. In classification tasks, discrepancy-based methods such as D-MMD [175] achieved a mean accuracy of
表4展示了不同类别域适应(DA)模型的比较。在分类任务中,基于差异的算法如D-MMD [175]在HSD数据集(晴天到多云/雨天/雪天)上实现了平均准确率为
For object detection, the I2IT adversarial framework for day-to-night transformation [168] achieved an mAP of
在目标检测方面,基于I2IT的昼夜转换对抗框架[168]在BDD数据集上未使用增强时实现了mAP为
In segmentation tasks, the adversarial I2IT framework with DaytoNight [168] achieved an mIoU of
在分割任务中,采用DaytoNight [168]的对抗性图像到图像转换(I2IT)框架在BDD数据集上未增强时实现了
For I2IT, the adversarial VAE-GAN [148] achieved a PSNR of 27.38 and an SSIM of 0.93 on the Apollo traffic scene dataset (Haze to Dehaze), outperforming UNIT and CycleGAN on the same dataset. In one study [168], CycleGAN achieved an FID of 35.28 on the BDD dataset for the day-to-night task, while DaytoNight, SemGAN, and AugGAN reported FID scores of 39.26, 39.91, and 67.07, respectively. In a separate study [182], CycleGAN attained an FID of 35.52, whereas SemGAN slightly improved to 35.26. Additionally, AugGAN and MUNIT produced FID scores of 57.72 and 69.97, respectively. Notably, the SGND model achieved the lowest FID score of 31.25, suggesting superior image quality for the generated scenes compared to the other models.
对于图像到图像转换(I2IT),对抗性VAE-GAN [148]在Apollo交通场景数据集(雾霾到去雾)上实现了27.38的峰值信噪比(PSNR)和0.93的结构相似性指数(SSIM),优于同一数据集上的UNIT和CycleGAN。在一项研究中 [168],CycleGAN在BDD数据集的昼夜转换任务中实现了35.28的FID,而DaytoNight、SemGAN和AugGAN分别报告了39.26、39.91和67.07的FID分数。在另一项研究 [182]中,CycleGAN获得了35.52的FID,SemGAN略微提升至35.26。此外,AugGAN和MUNIT分别产生了57.72和69.97的FID分数。值得注意的是,SGND模型实现了最低的31.25的FID分数,表明其生成场景的图像质量优于其他模型。
In Person Re-ID, clustering-based methods showed significant improvements. CDCL [149] achieved an mAP of 81.5% when adapting from DukeMTMC-ReID to Market1501. DMD [150] further improved the mAP to 92.7% on the same dataset pair, indicating the effectiveness of clustering techniques in handling domain shifts for Person Re-ID tasks. Discrepancy-based methods like D-MMD [162] achieved a lower mAP of 48.8%, suggesting that clustering methods may be more suitable for this application.
在人脸重识别(Person Re-ID)中,基于聚类的方法表现出显著提升。CDCL [149]在从DukeMTMC-ReID到Market1501的适应中实现了81.5%的mAP。DMD [150]在相同数据集对上进一步将mAP提升至92.7%,表明聚类技术在处理Person Re-ID任务中的域转移问题上效果显著。基于差异的方法如D-MMD [162]实现了较低的48.8% mAP,暗示聚类方法可能更适合该应用。
The results demonstrate that the choice of DA method significantly impacts performance across various applications. Clustering-based methods are particularly strong in Person Re-ID,with DMD [150] achieving a high mAP of 92.7%. Discrepancy-based methods are effective in classification and object detection, exemplified by ML-ANet [161] with 94.83% accuracy in classification. Adversarial-based methods perform well across multiple tasks, including classification, object detection, segmentation, and I2IT, with models like SADA [175] achieving 93.20% accuracy and SGND [182] reaching the best FID of 31.25 in I2IT. Effectively addressing domain gaps is crucial for enhancing model generalization in various applications. Addressing the domain gap effectively is crucial for enhancing the generalization capacity of ML models in various applications.
结果表明,选择不同的域自适应(DA)方法对各类应用的性能有显著影响。基于聚类的方法在Person Re-ID中表现尤为突出,DMD [150]实现了92.7%的高mAP。基于差异的方法在分类和目标检测中效果显著,如ML-ANet [161]在分类中达到94.83%的准确率。基于对抗的方法在分类、目标检测、分割和图像到图像转换(I2IT)等多任务中表现良好,模型如SADA [175]实现了93.20%的准确率,SGND [182]在I2IT中获得了最佳的31.25 FID。有效解决域间差异对于提升模型在各类应用中的泛化能力至关重要。
In this section, we examine the key features of deep learning models for traffic scene understanding, highlighting their strengths, limitations, and potential areas for enhancement. The discussion covers discriminative, generative, and DA models. Table 5 summarizes the shortcomings and potential future directions for improvement across these categories, providing an overview to guide further research.
本节我们探讨交通场景理解中深度学习模型的关键特性,重点分析其优势、局限及潜在的改进方向。讨论涵盖判别模型、生成模型和域自适应模型。表5总结了这些类别的不足及未来改进方向,为后续研究提供指导。
This subsection discusses the discriminative models, emphasizing their role in traffic scene understanding by examining their strengths and limitations, paving the way for future advancements and potential research directions.
本小节讨论判别模型,强调其在交通场景理解中的作用,分析其优势与局限,为未来的进展和研究方向奠定基础。
Future Work: Future research should focus on reducing reliance on large labeled datasets through semi-supervised and self-supervised learning, improving CNN generalization. Adding context-aware modules like attention mechanisms or non-local operations can help capture global dependencies in traffic scenes, boosting performance in complex environments without the computational cost of transformers.
未来工作:未来研究应着重于通过半监督和自监督学习减少对大规模标注数据集的依赖,提升CNN的泛化能力。引入注意力机制或非局部操作等上下文感知模块,有助于捕捉交通场景中的全局依赖,从而在复杂环境中提升性能,同时避免变换器(transformers)带来的计算开销。
Future Work: For future work, we largely expect efforts to focus on the more developed Vanilla R-CNN variants. One possible direction specifically for R-CNN is the development of more efficient algorithms for generating region proposals that can be integrated seamlessly into the R-CNN pipeline. Furthermore, improvements to the ROI pooling procedure could enhance performance, particularly for small-object detection, which would be especially beneficial for certain traffic scene processing tasks such as aerial traffic tracking. Enhancements should also focus on better handling occlusions by incorporating more sophisticated feature extraction techniques that can capture partially hidden objects effectively.
未来工作:未来工作预计将主要聚焦于更成熟的原始R-CNN变体。针对R-CNN的一个可能方向是开发更高效的区域提议算法,能够无缝集成到R-CNN流程中。此外,改进ROI池化过程可提升性能,尤其是小目标检测,这对某些交通场景处理任务(如空中交通跟踪)尤为有益。增强措施还应侧重于通过更复杂的特征提取技术更好地处理遮挡,能够有效捕捉部分隐藏的目标。
Future Work: Although Faster R-CNN has addressed several limitations of Fast R-CNN, R-CNN could still benefit from advancements in representational learning by leveraging modern architectures such as ViTs.
未来工作:尽管Faster R-CNN解决了Fast R-CNN的若干局限,R-CNN仍可通过利用现代架构如ViTs(视觉变换器)在表征学习方面获益。
Future Work: Future work could investigate the development of lighter, more resource-efficient Faster R-CNN variants that retain high detection performance while enabling deployment in real-time traffic applications. Additionally, enhancing small-object detection capabilities in Faster R-CNN represents another promising research direction.
未来工作:未来的研究可以探索开发更轻量、更节省资源的Faster R-CNN变体,在保持高检测性能的同时,实现实时交通应用的部署。此外,提升Faster R-CNN在小目标检测方面的能力也是一个有前景的研究方向。
Disadvantages:
劣势:
Future Work: Future research should optimize the mask prediction branch to reduce computational overhead while maintaining high accuracy. Enhancing small-object segmentation with multi-scale fusion and advanced attention mechanisms, along with robust algorithms for occlusions using 3D spatial data or multi-view inputs, could further advance Mask R-CNN.
未来工作:未来研究应优化掩码预测分支以降低计算开销,同时保持高精度。通过多尺度融合和先进的注意力机制提升小目标分割能力,结合利用三维空间数据或多视角输入的鲁棒遮挡处理算法,有望进一步推动Mask R-CNN的发展。
Future Work: Future research should enhance YOLO's detection of small and overlapping objects by refining its grid-based approach, advancing post-processing techniques, and developing more versatile backbones. Efforts should also focus on improving adaptability to challenging environments like adverse weather or crowded scenes and optimizing detection through anchor-free architectures and advanced attention mechanisms.
未来工作:未来研究应通过优化基于网格的方法、提升后处理技术及开发更通用的骨干网络,增强YOLO对小物体和重叠物体的检测能力。同时,应着力提升其对恶劣天气或拥挤场景等复杂环境的适应性,并通过无锚框架和先进的注意力机制优化检测性能。
Future Work: Future research should prioritize reducing the computational cost of ViT models and their dependence on large datasets. As advancements are made in these areas, ViTs could be explored as viable alternatives for general computer vision tasks in traffic scenes, where CNNs currently prevail. Furthermore, enhancing ViTs' ability to generalize with limited annotated data through transfer learning and data-efficient training techniques could help mitigate overfitting and improve their applicability to traffic scene tasks with scarce data.
未来工作:未来的研究应优先减少ViT(视觉Transformer)模型的计算成本及其对大规模数据集的依赖。随着这些领域的进展,ViT有望作为交通场景中通用计算机视觉任务的可行替代方案,目前这些任务主要由卷积神经网络(CNN)主导。此外,通过迁移学习和数据高效训练技术提升ViT在有限标注数据下的泛化能力,有助于缓解过拟合问题,增强其在数据稀缺的交通场景任务中的适用性。
Future Work: To address DETR's slow convergence, future work could focus on developing more efficient training paradigms that reduce the number of epochs required for convergence. Another promising research direction involves improving the detection of smaller or occluded objects in traffic scenes by modifying the attention mechanism to better capture fine details. Reducing computational overhead to enable real-time deployment is also crucial, as is exploring more flexible approaches for controlling the maximum number of detected objects.
未来工作:为解决DETR训练收敛慢的问题,未来研究可聚焦于开发更高效的训练范式,减少收敛所需的迭代次数。另一个有前景的方向是通过改进注意力机制,更好地捕捉细节,从而提升对交通场景中小目标或遮挡物的检测能力。降低计算开销以实现实时部署同样至关重要,同时探索更灵活的检测目标数量控制方法也是研究重点。
Future Work: Future research should focus on efficient graph construction for dynamic traffic scenes, including adaptive real-time updates and robust occlusion handling. Vision GNNs can model spatial relationships and reconstruct occluded features. Reducing GNN training complexity and exploring hybrid approaches with self-supervised learning or multi-modal fusion can further improve robustness and efficiency.
未来工作:未来研究应聚焦于动态交通场景中高效的图构建,包括自适应实时更新和鲁棒的遮挡处理。视觉GNNs能够建模空间关系并重建被遮挡特征。降低GNN训练复杂度,探索与自监督学习或多模态融合的混合方法,可进一步提升鲁棒性和效率。
Future Work: A promising direction for future research is to optimize the routing-by-agreement mechanism to reduce computational complexity while preserving the model's ability to capture spatial hierarchies. Such improvements would enhance the practicality of CapsNets for real-world traffic scene understanding.
未来工作:未来研究的一个有前景的方向是优化基于协议的路由机制,以降低计算复杂度,同时保持模型捕捉空间层次结构的能力。这类改进将提升胶囊网络(CapsNets)在实际交通场景理解中的实用性。
This subsection explores the generative models, highlighting their significance in traffic scene understanding by analyzing their advantages and challenges, while also suggesting potential future improvements and research opportunities.
本小节探讨生成模型,强调其在交通场景理解中的重要性,通过分析其优势与挑战,同时提出潜在的未来改进和研究机会。
Future Work: In the future, researchers should explore mechanisms to address training instability, such as developing more robust optimization methods or hybrid models that integrate GANs with more stable generative frameworks. Incorporating regularization techniques into the objective function may also help mitigate mode collapse, making GANs more suitable for generating realistic synthetic traffic scene data.
未来工作:未来研究应探索解决训练不稳定的机制,如开发更鲁棒的优化方法或将GAN与更稳定的生成框架结合的混合模型。将正则化技术纳入目标函数也可能有助于缓解模式崩溃,使GAN更适合生成逼真的合成交通场景数据。
Future Work: Future work could prioritize mitigating the overfitting risk in cGANs by developing more effective regularization techniques and enhancing the diversity of conditional data. Additionally, researchers could explore methods to reduce computational overhead by designing lightweight architectures better suited for real-time traffic scene generation.
未来工作:未来的研究可优先考虑通过开发更有效的正则化技术和增强条件数据的多样性来缓解cGANs的过拟合风险。此外,研究人员还可以探索设计更轻量级架构以减少计算开销,更适合实时交通场景生成的方法。
Future Work: In the future, researchers are expected to develop improved adversarial training paradigms to enhance the reconstruction quality of VAE models. Additionally, efforts should focus on designing optimization mechanisms that automatically address the trade-off between accurate reconstruction and diverse data generation.
未来工作:未来研究预计将开发改进的对抗训练范式以提升VAE模型的重构质量。同时,应致力于设计自动调节准确重构与多样数据生成权衡的优化机制。
This subsection examines the DA models, focusing on their contributions to traffic scene understanding by evaluating their benefits and limitations, and outlining avenues for future research and development.
本小节探讨域适应(DA)模型,重点评估其在交通场景理解中的贡献,分析其优缺点,并概述未来研究与发展的方向。
Future Work: In the future, we expect research to focus on improving the robustness of cluster formation in noisy or overlapping domains, enhancing scalability for large and complex datasets, and exploring novel approaches to integrating clustering with other DA methods to improve performance across diverse tasks.
未来工作:未来研究预计将聚焦于提升噪声或重叠领域中聚类形成的鲁棒性,增强对大规模复杂数据集的可扩展性,并探索将聚类与其他领域自适应方法结合的新途径,以提升多样任务的性能。
TABLE 5. Summary of shortcomings and future directions for improvement in Discriminative, Generative, and DA models.
表5. 判别式、生成式及领域自适应模型的缺点总结及改进方向。
| Category | Framework | Limitations | Future Works |
| Discriminative | CNN | - Data dependency - Generalization issues - Limited global contextual understanding | - Explore semi-supervised and self-supervised learning methods - Integrate attention mechanisms for global dependencies |
| Vanilla R-CNN | - Inefficient region proposal strategy - High Memory Consumption - Lack of End-to-End Training - Challenges with Occlusions | - Develop efficient region proposal algorithms - Improve the ROI pooling procedure - Develop techniques for improved handling of occluded objects | |
| Fast R-CNN | - Inefficient Region Proposal Strategy - Lower Small-Object Detection Accuracy | - Leverage Faster R-CNN to improve region proposal efficiency - Enhance representational learning with advanced networks like ViT | |
| Faster R-CNN | - Limited real-time performance - Being resource-intensive - Lower small-object detection accuracy | - Develop lighter, more resource-efficient variants - Improve small-object detection | |
| Mask R-CNN | - High Computational Demand - Small Object Segmentation Issues - Training Complexity | - Optimize mask prediction for efficiency - Improve small-object segmentation with multi-scale fusion - Handle occlusions using 3D or multi-view data | |
| YOLO | - Difficulty with small objects - Challenges with overlapping Objects - Accuracy trade-offs - Sensitivity to Occlusions | - Refine grid-based approach and develop post-processing techniques - Improve adaptability to challenging environments - Enhance occlusion handling with robust feature extraction | |
| ViT | - Heavy data and computation - Overfitting risk - Training complexity | - Reduce computational costs - Improve transfer learning and data-efficient training methods | |
| DETR | - Slow training convergence - Difficulty with small and occluded ob- jects - Computational overhead | - Develop more efficient training paradigms - Improve small-object detection - Reduce computational overhead for real-time deployment | |
| GNN | - High computational demand - Data preprocessing complexity - Training complexity | - Develop graph construction process for images - Reduce training complexity - Leverage ViGs for handling occlusions and reconstructing object features - Explore hybrid approaches with self-supervised learning and multi- modal fusion | |
| CapsNet | - Computational complexity - Scalability issues | - Optimize routing-by-agreement mechanism - Enhance scalability for real-world applications | |
| Generative | GAN | - Training instability - Mode collapse risk | - Develop robust optimization methods - Explore hybrid models and regularization techniques |
| cGAN | - Training complexity - Risk of conditional overfitting - Higher resource requirements | - Improve regularization techniques - Reduce computational overhead with lightweight architectures | |
| VAE | - Blurry reconstructions - Reconstruction vs. generation trade-off | - Enhance reconstruction quality through adversarial training - Balance accurate reconstruction and diverse generation | |
| DA | Clustering | - Sensitivity to Cluster Quality - Difficulty in Handling Domain Overlap - Scalability Challenges - Reliance on Pseudo-Labels and Proto- types | - Improve Robustness of Cluster Formation in Noisy Domains - Enhance Scalability for Large Complex Datasets - Integrate Clustering with Other DA Methods |
| Discrepancy | - Dependence on Metric Choice - Limited Adaptation to Complex Shifts - Sensitivity to Feature Representation | - Develop Flexible Discrepancy Metrics for Complex Domain Shifts - Improve Feature Representation Techniques - Combine Discrepancy-Based Methods with Other Adaptation Strate- gies | |
| Adversarial | - Training Instability - Mode Collapse Risk - Sensitive to Hyperparameters | - Improve Stability of Adversarial Training - Address Mode Collapse Issues - Develop Robust Hyperparameter Tuning Approaches |
| 类别 | 框架 | 局限性 | 未来工作 |
| 判别式 | CNN(卷积神经网络) | - 数据依赖性 - 泛化能力问题 - 有限的全局上下文理解 | - 探索半监督和自监督学习方法 - 融入注意力机制以捕捉全局依赖 |
| 基础R-CNN | - 区域提议策略效率低 - 高内存消耗 - 缺乏端到端训练 - 遮挡问题挑战 | - 开发高效的区域提议算法 - 改进ROI池化过程 - 研发更好处理遮挡物体的技术 | |
| Fast R-CNN | - 区域提议策略效率低 - 小目标检测准确率较低 | - 利用Faster R-CNN提升区域提议效率 - 采用ViT(视觉Transformer)等先进网络增强表征学习 | |
| Faster R-CNN | - 实时性能有限 - 资源消耗大 - 小目标检测准确率较低 | - 开发更轻量、资源高效的变体 - 改进小目标检测 | |
| Mask R-CNN | - 计算需求高 - 小目标分割问题 - 训练复杂 | - 优化掩码预测以提升效率 - 通过多尺度融合改善小目标分割 - 利用三维或多视角数据处理遮挡 | |
| YOLO(你只看一次) | - 小目标检测困难 - 重叠物体挑战 - 准确率权衡 - 对遮挡敏感 | - 优化基于网格的方法并开发后处理技术 - 提升对复杂环境的适应性 - 通过鲁棒特征提取增强遮挡处理 | |
| ViT(视觉Transformer) | - 数据和计算量大 - 过拟合风险 - 训练复杂 | - 降低计算成本 - 改进迁移学习和数据高效训练方法 | |
| DETR(端到端目标检测器) | - 训练收敛慢 - 小目标和遮挡物体检测困难 - 计算开销大 | - 开发更高效的训练范式 - 改进小目标检测 - 降低实时部署的计算开销 | |
| GNN(图神经网络) | - 计算需求高 - 数据预处理复杂 - 训练复杂 | - 开发图像图构建流程 - 降低训练复杂度 - 利用ViGs处理遮挡和重建物体特征 - 探索自监督学习与多模态融合的混合方法 | |
| CapsNet(胶囊网络) | - 计算复杂度高 - 可扩展性问题 | - 优化路由协议机制 - 提升实际应用的可扩展性 | |
| 生成式 | GAN(生成对抗网络) | - 训练不稳定 - 模式崩溃风险 | - 开发稳健的优化方法 - 探索混合模型和正则化技术 |
| cGAN(条件生成对抗网络) | - 训练复杂 - 条件过拟合风险 - 资源需求较高 | - 改进正则化技术 - 通过轻量架构降低计算开销 | |
| VAE(变分自编码器) | - 重建模糊 - 重建与生成的权衡 | - 通过对抗训练提升重建质量 - 平衡准确重建与多样化生成 | |
| DA(领域自适应) | 聚类 | - 对聚类质量敏感 - 处理领域重叠困难 - 可扩展性挑战 - 依赖伪标签和原型 | - 提升噪声领域中聚类形成的鲁棒性 - 增强大规模复杂数据集的可扩展性 - 将聚类与其他领域自适应方法结合 |
| 差异度 | - 依赖度量选择 - 对复杂域偏移适应有限 - 对特征表示敏感 | - 开发适应复杂域偏移的灵活差异度量 - 改进特征表示技术 - 将基于差异度的方法与其他自适应策略结合 | |
| 对抗式 | - 训练不稳定 - 模式崩溃风险 - 对超参数敏感 | - 提升对抗训练的稳定性 - 解决模式崩溃问题 - 开发稳健的超参数调优方法 |
Future Work: In the future, researchers should explore developing more flexible discrepancy metrics that can handle complex domain shifts, improving feature representation techniques, and exploring hybrid approaches that combine discrepancy-based methods with other adaptation strategies to enhance robustness and generalization across diverse traffic scene understanding tasks.
未来工作:未来,研究人员应探索开发更灵活的差异度量方法,以应对复杂的领域迁移,改进特征表示技术,并探索将基于差异的方法与其他适应策略相结合的混合方法,以增强在多样化交通场景理解任务中的鲁棒性和泛化能力。
Future Work: Future research should aim to improve the stability of adversarial training, address mode collapse issues, and explore more robust approaches to hyperpa-rameter tuning to enhance the scalability and reliability of adversarial-based DA methods across a wider range of tasks.
未来工作:未来研究应致力于提升对抗训练的稳定性,解决模式崩溃问题,并探索更鲁棒的超参数调优方法,以增强基于对抗的领域自适应方法在更广泛任务中的可扩展性和可靠性。
While advancements in DL methods have significantly improved traffic scene understanding, further progress is possible. This section highlights key research topics for future work, emphasizing the need for more reliable, versatile, efficient, and scalable DL frameworks. The performance of these systems, particularly in complex real-world scenarios, hinges on the quality of underlying DL models. Future work should focus on developing models that address practical challenges like real-time performance and generalizability, along with holistic challenges such as integrating multi-modal data and enhancing model interpretability.
尽管深度学习方法的进步显著提升了交通场景理解,但仍有进一步发展的空间。本节重点介绍未来工作的关键研究主题,强调需要更可靠、多功能、高效且可扩展的深度学习框架。这些系统的性能,尤其是在复杂的现实场景中,依赖于底层深度学习模型的质量。未来工作应聚焦于开发能够解决实时性能和泛化能力等实际挑战的模型,以及整合多模态数据和提升模型可解释性等整体性挑战。
Exploring methodologies from diverse domains could boost the robustness and versatility of traffic scene understanding models. For example, disaster management techniques [189] may inspire innovative approaches to traffic analysis. Adapting successful strategies from other fields could improve scalability and reliability. Additionally, integrating real-time algorithms, like the simultaneous vehicle detection and tracking method for aerial videos [190], could enhance the speed and scalability of DL models for urban traffic scene understanding.
借鉴不同领域的方法论可能提升交通场景理解模型的鲁棒性和多样性。例如,灾害管理技术[189]可能为交通分析带来创新思路。借鉴其他领域的成功策略可提升模型的可扩展性和可靠性。此外,集成实时算法,如用于航拍视频的车辆检测与跟踪方法[190],可增强深度学习模型在城市交通场景理解中的速度和可扩展性。
Most computer vision-based deep learning frameworks for traffic scene understanding operate as black-box models, lacking simple or straightforward methods for assessing and interpreting their outputs. This raises concerns regarding the reliability and transparency of such systems, as well as the feasibility of deploying real-world applications that leverage traffic scene understanding for decision-making tasks, such as road safety and risk assessment [191]. The inability to justify decisions based on the vision component's output further complicates their deployment. Especially for downstream applications like autonomous driving, it is preferable to have some mechanism for evaluating how the deep model actually understands a traffic scene, which allows for a domain expert to identify gaps in a model's capabilities. XAI addresses this issue by providing techniques and methodologies-such as post-hoc explanation methods or inherently interpretable architectures-that clarify how inputs influence the model's output. Indeed, for real-world systems such as automated urban intervention systems, which aim to improve pedestrian and vehicle safety by leveraging DNNs for detection, tracking, and behavior prediction, researchers have recently proposed adopting XAI techniques to provide insights into traffic control, surveillance, and collision prevention for autonomous vehicles [192]. Recent research [191] has introduced the interpretability of NNs in traffic sign recognition systems to enhance road safety and optimize traffic management by leveraging XAI techniques like Local Interpretable Model-Agnostic Explanations (LIME) and Gradient-weighted Class Activation Mapping (Grad-CAM). LIME provides explainability by approximating the behavior of a model locally given some predictions, while Grad-CAM generates heat maps that show which regions of an image contribute most to a prediction based on the activated gradients within the deep layers. Moreover, the significance of scene understanding for autonomous vehicles in unstructured traffic environments is emphasized in [193], suggesting the use of models like the Inception U-Net with Grad-CAM visualization to enhance navigation in crowded traffic scenarios.
大多数基于计算机视觉的交通场景理解深度学习框架作为黑箱模型运行,缺乏简单直接的评估和解释其输出的方法。这引发了对系统可靠性和透明性的担忧,以及在实际应用中利用交通场景理解进行决策(如道路安全和风险评估[191])的可行性问题。基于视觉组件输出无法解释决策,进一步增加了部署难度。尤其对于自动驾驶等下游应用,最好具备某种机制来评估深度模型对交通场景的实际理解,以便领域专家识别模型能力的不足。可解释人工智能(XAI)通过提供技术和方法——如事后解释方法或内在可解释架构——阐明输入如何影响模型输出,解决了这一问题。实际上,对于旨在通过深度神经网络(DNN)实现检测、跟踪和行为预测以提升行人和车辆安全的自动化城市干预系统,研究者近期提出采用XAI技术,为交通控制、监控和自动驾驶车辆的碰撞预防提供洞见[192]。近期研究[191]引入了神经网络在交通标志识别系统中的可解释性,利用局部可解释模型无关解释(LIME)和梯度加权类激活映射(Grad-CAM)等XAI技术,提升道路安全和优化交通管理。LIME通过局部近似模型行为提供解释,而Grad-CAM基于深层激活梯度生成热力图,显示图像中对预测贡献最大的区域。此外,[193]强调了在非结构化交通环境中自动驾驶车辆场景理解的重要性,建议使用如Inception U-Net结合Grad-CAM可视化的模型,以增强拥挤交通场景下的导航能力。
While XAI has been applied to specific traffic computer vision tasks, significant limitations remain in performance and integration. Improvements are needed for methods like LIME and Grad-CAM, particularly as research shifts from CNN-based learning to ViT and GNN. Most XAI focuses on single-model outputs, overlooking complex systems like multi-target multi-camera tracking. Further exploration is required to integrate XAI into multi-modal systems, as demonstrated by a recent autonomous driving XAI system using multi-modal image captioning for decision-making justification [194]. This opens new possibilities for developing XAI systems that merge text and image data to interpret traffic systems' decision-making. Additionally, integrating XAI for real-time explainability could enhance insights in applications like traffic anomaly detection and object detection, improving robustness in challenging conditions, such as adverse weather segmentation [195].
虽然可解释人工智能(XAI)已应用于特定的交通计算机视觉任务,但在性能和集成方面仍存在显著限制。需要改进诸如LIME和Grad-CAM等方法,尤其是在研究从基于卷积神经网络(CNN)的学习转向视觉Transformer(ViT)和图神经网络(GNN)时。大多数XAI关注单一模型输出,忽视了多目标多摄像头跟踪等复杂系统。需要进一步探索将XAI集成到多模态系统中,正如最近一个利用多模态图像字幕生成进行决策解释的自动驾驶XAI系统所示[194]。这为开发融合文本和图像数据以解释交通系统决策的新型XAI系统开辟了新可能。此外,将XAI集成用于实时可解释性,能够增强交通异常检测和目标检测等应用中的洞察力,提高在恶劣天气分割等挑战条件下的鲁棒性[195]。
As discussed in this work, transformers and GNNs have gained increasing attention in recent years, with studies showing that ViT and deep GNNs can rival leading CNN architectures while often reducing computational demands [88], [195]. While CNNs have dominated feature extraction in computer vision for over a decade, emerging architectures offer promising opportunities for further advancement. ViT models, for example, require higher-quality data than CNNs due to their lack of inductive bias [196], [197], which historically made CNNs more robust to challenges like occlusion by enforcing local spatial coherence. Influenced by non-local neural networks, ViTs leverage global attention to better handle complex occlusions through long-range dependency modeling. Similarly, GNNs, traditionally limited by over-smoothing in shallow architectures [198], have seen breakthroughs enabling deeper models [199], [200]. Competitive vision GNNs (ViGs), such as the recent model by [195], now match CNN and ViT performance in tasks like classification and detection. GNNs excel at representing graph-structured data, making them effective for reconstructing occluded objects and reasoning about partially visible entities in traffic scenes. Self-supervised learning (SSL) also holds strong potential for these architectures, as methods like contrastive learning [155] enhance performance and robustness, helping mitigate occlusion by fostering holistic representations from incomplete or obstructed data.
如本文所述,近年来Transformer和图神经网络(GNN)受到越来越多关注,研究表明视觉Transformer(ViT)和深度GNN在性能上可与领先的CNN架构媲美,同时通常降低计算需求[88],[195]。尽管CNN在计算机视觉特征提取领域已主导十余年,新兴架构为进一步发展提供了有希望的机会。例如,ViT模型由于缺乏归纳偏置[196],[197],相比CNN需要更高质量的数据,CNN通过强制局部空间一致性在历史上对遮挡等挑战更具鲁棒性。受非局部神经网络影响,ViT利用全局注意力通过长距离依赖建模更好地处理复杂遮挡。同样,传统上受限于浅层架构过度平滑问题的GNN[198],已取得突破,支持更深层模型[199],[200]。竞争性视觉GNN(ViGs),如[195]最新模型,在分类和检测任务中已能匹敌CNN和ViT性能。GNN擅长表示图结构数据,因而在重建遮挡物体和推理交通场景中部分可见实体方面表现出色。自监督学习(SSL)对这些架构也具有强大潜力,诸如对比学习[155]等方法提升性能和鲁棒性,有助于通过从不完整或遮挡数据中构建整体表示来缓解遮挡问题。
Critically, while many of the models discussed demonstrate strong performance in offline traffic vision tasks, such as object detection, they face significant challenges in real-time processing applications due to inefficiencies. For example, although YOLOv8, one of the most recent versions of the YOLO family and the latest one introduced in this paper, achieves high performance in traffic object detection, the variants still struggle with small-object detection, multi-scale object detection, and detection under adverse environmental conditions [201]. Recent studies have shown that transformer-based architectures can achieve significantly lower latency; however, these models still face difficulties with small-object detection and other challenging conditions [159]. In other domains, researchers have explored the combination of generative models, such as GANs [124], with ViTs [88] to address complex scenarios [202], though further research is necessary to mitigate the high computational costs associated with GANs. For complex traffic scene applications, particularly those involving multiple cameras and downstream decision agents, the underlying deep learning model must be both lightweight and capable of delivering high performance.
关键是,尽管许多讨论的模型在离线交通视觉任务(如目标检测)中表现优异,但由于效率问题,在实时处理应用中面临重大挑战。例如,YOLO系列最新版本YOLOv8在交通目标检测中表现出色,但其变体仍在小目标检测、多尺度目标检测及恶劣环境条件下检测方面存在困难[201]。近期研究表明,基于Transformer的架构可显著降低延迟;然而,这些模型在小目标检测及其他复杂条件下仍面临挑战[159]。在其他领域,研究者探索将生成模型如生成对抗网络(GAN)[124]与ViT结合以应对复杂场景[202],但仍需进一步研究以降低GAN的高计算成本。对于涉及多摄像头和下游决策代理的复杂交通场景应用,底层深度学习模型必须既轻量又能提供高性能。
Currently, many widely used datasets for traffic scene understanding tasks consist of synthetic data, such as SYN-THIA [203] and GTA5 [204], which feature automatically annotated images from traffic scenes created in the Unity Game Engine and the GTA 5 game environment. Generative AI and DA are commonly employed to address limitations in training data by generating augmented samples for rare or hard-to-capture scenarios, including occluded objects or accidents under adverse weather conditions [205]. Despite advancements, data from virtual simulations remains limited unless traffic objects behave realistically and scenes feature high-fidelity graphics comparable to real-world data. While some work has pursued generating photo-realistic traffic scenes for computer vision tasks [206], recent improvements in virtual engines enable much higher-quality synthetic data generation with automatic labeling at scale [207]. Enhanced rendering capabilities allow the simulation of diverse traffic scenarios, including challenging conditions like rain, snow, or occlusion-heavy nighttime settings. For tasks with limited data or imbalances, synthetic data can help improve model performance on occluded objects and other real-world challenges, reducing reliance on manual annotation. Finally, integrating simulated data with generative augmentation techniques [208] presents a promising approach to mitigate data scarcity while addressing occlusion-related challenges in traffic scene understanding.
目前,许多广泛使用的交通场景理解任务数据集由合成数据构成,如SYN-THIA[203]和GTA5[204],这些数据集包含在Unity游戏引擎和GTA 5游戏环境中自动标注的交通场景图像。生成式人工智能和数据增强(DA)常用于通过生成稀有或难以捕捉场景的增强样本来解决训练数据的限制,包括遮挡物体或恶劣天气下的事故[205]。尽管取得进展,虚拟仿真数据仍受限,除非交通物体行为逼真且场景具备与真实世界数据相当的高保真图形。一些工作致力于生成用于计算机视觉任务的照片级真实交通场景[206],而近期虚拟引擎的改进使得大规模自动标注的高质量合成数据生成成为可能[207]。增强的渲染能力支持模拟多样化交通场景,包括雨雪或遮挡严重的夜间等挑战条件。对于数据有限或不平衡的任务,合成数据有助于提升模型在遮挡物体及其他现实挑战上的表现,减少对人工标注的依赖。最后,将模拟数据与生成式增强技术结合[208],为缓解数据稀缺及解决交通场景理解中的遮挡问题提供了有前景的途径。
The robustness and comprehensiveness of object detection and segmentation in traffic scenes could be significantly enhanced by leveraging the fusion of data from multimodal sensory inputs, such as panoramic images, LiDAR (Light Detection and Ranging) point clouds, thermal imaging, infrared, and video footage. Additionally, incorporating the sophisticated reasoning capabilities of large language models (LLMs) and multimodal LLMs (MLLMs) [209], [210], [211] could facilitate the integration of real-time text-based and linguistic communication with image and video data [212]. Furthermore, although [213] has made progress in applying language-based knowledge guidance, most research focuses on data fusion in only two domains [214], [215], [216]. A comprehensive benchmark is essential for effectively comparing these works and advancing the development of more optimized and holistic multimodal approaches. Effective multi-sensor data fusion is critical. Designing, assessing, and optimizing the performance of fusion operations for deep generative models are key questions. Interoperability of different multi-modality methods [217] with existing infrastructure and their adaptability to evolving traffic conditions will be crucial for their successful implementation. Recent research, such as Feng et al. [217], emphasizes that multi-modal sensor fusion (e.g., LiDAR, cameras, radar) enhances robustness in object detection by addressing challenges like occlusion and adverse conditions. By effectively integrating complementary information from diverse sensor inputs, occluded objects can be more reliably detected and classified, thereby overcoming one of the significant limitations of single-modality perception approaches.
通过融合多模态传感器输入的数据,如全景图像、激光雷达(LiDAR,Light Detection and Ranging)点云、热成像、红外和视频资料,可以显著增强交通场景中目标检测和分割的鲁棒性与全面性。此外,结合大型语言模型(LLMs)和多模态大型语言模型(MLLMs)[209],[210],[211]的复杂推理能力,有助于实现基于文本的实时语言交流与图像及视频数据的融合[212]。尽管[213]在应用基于语言的知识引导方面取得了一定进展,但大多数研究仅聚焦于两个领域的数据融合[214],[215],[216]。建立一个全面的基准对于有效比较这些工作并推动更优化、更整体的多模态方法的发展至关重要。有效的多传感器数据融合是关键。设计、评估和优化深度生成模型的融合操作性能是核心问题。不同多模态方法[217]与现有基础设施的互操作性及其对不断变化的交通状况的适应性,将是其成功实施的关键。近期研究如Feng等人[217]强调,多模态传感器融合(如激光雷达、摄像头、雷达)通过解决遮挡和恶劣环境等挑战,提高了目标检测的鲁棒性。通过有效整合来自多样传感器输入的互补信息,可以更可靠地检测和分类被遮挡的目标,从而克服单一模态感知方法的重大局限性。
In conclusion, this review has provided an extensive exploration of deep learning models and their application to traffic scene understanding, a crucial component in advancing intelligent transportation systems. By categorizing and analyzing discriminative, generative, and domain adaptation models, we have offered a comprehensive perspective on the evolution of traffic scene analysis techniques, highlighting the significant advancements and ongoing challenges in the field. Our discussion on hyperparameter optimization has further emphasized the importance of fine-tuning these models for enhanced efficiency and real-time applicability.
综上所述,本综述对深度学习模型及其在交通场景理解中的应用进行了广泛探讨,交通场景理解是推动智能交通系统发展的关键组成部分。通过对判别模型、生成模型和领域自适应模型的分类与分析,我们提供了交通场景分析技术演进的全面视角,突出展示了该领域的重要进展与持续挑战。我们对超参数优化的讨论进一步强调了微调模型以提升效率和实时应用性的必要性。
This paper has addressed the gaps present in existing literature, such as the lack of focus on generative models, limited coverage of domain adaptation techniques, and insufficient analysis of hyperparameter optimization methods. By presenting a structured comparison of discriminative, generative, and DA models, we provided a nuanced understanding of each category's strengths and weaknesses, which can guide researchers in selecting appropriate models for their specific needs in traffic scene analysis. Furthermore, our review identified emerging areas such as XAI, multi-modal data integration, and real-time processing as pivotal research directions for future work.
本文针对现有文献中的不足进行了补充,如对生成模型关注不足、领域自适应技术覆盖有限以及超参数优化方法分析不充分。通过对判别模型、生成模型和领域自适应模型的结构化比较,我们提供了对各类别优缺点的细致理解,指导研究者根据具体需求选择合适的交通场景分析模型。此外,我们的综述指出了可解释人工智能(XAI)、多模态数据融合和实时处理等新兴领域,作为未来研究的关键方向。
Moving forward, it is evident that there is a growing need to enhance the robustness, interpretability, and efficiency of deep learning systems in traffic environments. We encourage future research efforts to focus on improving model performance under diverse environmental conditions, integrating multiple data sources for richer scene understanding, and advancing explainability to foster trust in AI-driven transportation systems. By addressing these challenges, we believe that deep learning will continue to play a pivotal role in shaping the future of intelligent, safe, and efficient transportation solutions. REFERENCES
展望未来,显然需要提升深度学习系统在交通环境中的鲁棒性、可解释性和效率。我们鼓励未来研究聚焦于提升模型在多样环境条件下的表现,整合多源数据以实现更丰富的场景理解,并推进可解释性以增强对人工智能驱动交通系统的信任。通过应对这些挑战,我们相信深度学习将在塑造智能、安全、高效交通解决方案的未来中继续发挥关键作用。参考文献
[1] A. Boukerche and Z. Hou, "Object detection using deep learning methods in traffic scenarios," ACM Comput. Surv., vol. 54, no. 2, pp. 1-35, Mar. 2021.
[1] A. Boukerche 和 Z. Hou,“基于深度学习方法的交通场景目标检测”,ACM计算机调查,卷54,第2期,页1-35,2021年3月。
[2] Y. Huang and Y. Chen, "Autonomous driving with deep learning: A survey of state-of-art technologies," 2020, arXiv:2006.06091.
[2] Y. Huang 和 Y. Chen,“基于深度学习的自动驾驶:最先进技术综述”,2020年,arXiv:2006.06091。
[3] Z. Guo, Y. Huang, X. Hu, H. Wei, and B. Zhao, "A survey on deep learning based approaches for scene understanding in autonomous driving," Electronics, vol. 10, no. 4, p. 471, Feb. 2021.
[3] Z. Guo, Y. Huang, X. Hu, H. Wei 和 B. Zhao,“基于深度学习的自动驾驶场景理解方法综述”,电子学,卷10,第4期,页471,2021年2月。
[4] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, "A survey of deep learning techniques for autonomous driving," J. Field Robot., vol. 37, no. 3, pp. 362-386, Apr. 2020.
[4] S. Grigorescu, B. Trasnea, T. Cocias 和 G. Macesanu,“自动驾驶深度学习技术综述”,现场机器人杂志,卷37,第3期,页362-386,2020年4月。
[5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[5] Y. Lecun, L. Bottou, Y. Bengio 和 P. Haffner,“基于梯度的文档识别学习”,IEEE会议录,卷86,第11期,页2278-2324,1998年。
[6] R. C. Luo, H. Potlapalli, and D. W. Hislop, "Translation and scale invariant landmark recognition using receptive field neural networks," in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., vol. 1, Jun. 1992, pp. 527-533, doi: 10.1109/IROS.1992.587385.
[6] R. C. Luo, H. Potlapalli 和 D. W. Hislop,“基于感受野神经网络的平移和尺度不变地标识别”,IEEE/RSJ国际智能机器人系统会议录,卷1,1992年6月,页527-533,doi: 10.1109/IROS.1992.587385。
[7] P. Sermanet and Y. LeCun, "Traffic sign recognition with multi-scale convolutional networks," in Proc. Int. Joint Conf. Neural Netw., Jul. 2011, pp. 2809-2813, doi: 10.1109/IJCNN.2011.6033589.
[7] P. Sermanet 和 Y. LeCun, “基于多尺度卷积网络的交通标志识别,” 载于《国际联合神经网络会议论文集》,2011年7月,第2809-2813页,doi: 10.1109/IJCNN.2011.6033589。
[8] R. Fan, H. Wang, P. Cai, and M. Liu, "SNE-RoadSeg: Incorporating surface normal information into semantic segmentation for accurate freespace detection," in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, Jan. 2020, pp. 340-356.
[8] R. Fan, H. Wang, P. Cai 和 M. Liu, “SNE-RoadSeg:将表面法线信息融入语义分割以实现精确的自由空间检测,” 载于《欧洲计算机视觉会议论文集》,瑞士Cham:Springer出版社,2020年1月,第340-356页。
[9] J. He, C. Zhang, X. He, and R. Dong, "Visual recognition of traffic police gestures with convolutional pose machine and handcrafted features," Neurocomputing, vol. 390, pp. 248-259, May 2020.
[9] J. He, C. Zhang, X. He 和 R. Dong, “结合卷积姿态机和手工特征的交通警察手势视觉识别,” 《神经计算》(Neurocomputing), 第390卷,第248-259页,2020年5月。
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580-587.
[10] R. Girshick, J. Donahue, T. Darrell 和 J. Malik, “用于精确目标检测和语义分割的丰富特征层次结构,” 载于《IEEE计算机视觉与模式识别会议论文集》,2014年6月,第580-587页。
[11] G. Vinod and G. Padmapriya, "An adaptable real-time object detection for traffic surveillance using R-CNN over CNN with improved accuracy," in Proc. Int. Conf. Bus. Anal. Technol. Secur. (ICBATS), Feb. 2022, pp. 1-4, doi: 10.1109/ICBATS54253.2022.9759030.
[11] G. Vinod 和 G. Padmapriya, “基于改进准确率的R-CNN结合CNN的适应性实时交通监控目标检测,” 载于《国际商业分析技术与安全会议(ICBATS)论文集》,2022年2月,第1-4页,doi: 10.1109/ICBATS54253.2022.9759030。
[12] J. Hosang, M. Omran, R. Benenson, and B. Schiele, "Taking a deeper look at pedestrians," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4073-4082, doi: 10.1109/CVPR.2015.7299034.
[12] J. Hosang, M. Omran, R. Benenson 和 B. Schiele, “深入研究行人检测,” 载于《IEEE计算机视觉与模式识别会议(CVPR)论文集》,2015年6月,第4073-4082页,doi: 10.1109/CVPR.2015.7299034。
[13] V. Murugan, V. R. Vijaykumar, and A. Nidhila, "A deep learning RCNN approach for vehicle recognition in traffic surveillance system," in Proc. Int. Conf. Commun. Signal Process. (ICCSP), Apr. 2019, pp. 157-160.
[13] V. Murugan, V. R. Vijaykumar 和 A. Nidhila, “基于深度学习的RCNN方法用于交通监控系统中的车辆识别,” 载于《国际通信与信号处理会议(ICCSP)论文集》,2019年4月,第157-160页。
[14] J. Zhang, Z. Xie, J. Sun, X. Zou, and J. Wang, "A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection," IEEE Access, vol. 8, pp. 29742-29754, 2020.
[14] J. Zhang, Z. Xie, J. Sun, X. Zou 和 J. Wang, “结合多尺度注意力和样本不平衡的级联R-CNN用于交通标志检测,” 《IEEE Access》,第8卷,第29742-29754页,2020年。
[15] J. Cao, J. Zhang, and X. Jin, "A traffic-sign detection algorithm based on improved sparse R-CNN," IEEE Access, vol. 9, pp. 122774-122788, 2021, doi: 10.1109/ACCESS.2021.3109606.
[15] J. Cao, J. Zhang 和 X. Jin, “基于改进稀疏R-CNN的交通标志检测算法,” 《IEEE Access》,第9卷,第122774-122788页,2021年,doi: 10.1109/ACCESS.2021.3109606。
[16] C. Lin, Y. Shi, J. Zhang, C. Xie, W. Chen, and Y. Chen, "An anchor-free detector and R-CNN integrated neural network architecture for environmental perception of urban roads," Proc. Inst. Mech. Eng., D, J. Automobile Eng., vol. 235, no. 12, pp. 2964-2973, Oct. 2021.
[16] C. Lin, Y. Shi, J. Zhang, C. Xie, W. Chen 和 Y. Chen, “一种无锚点检测器与R-CNN集成的神经网络架构用于城市道路环境感知,” 《机械工程师学会学报D辑,汽车工程杂志》,第235卷,第12期,第2964-2973页,2021年10月。
[17] P. Li, Y. He, D. Yin, F. R. Yu, and P. Song, "Bagging R-CNN: Ensemble for object detection in complex traffic scenes," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10097085.
[17] P. Li, Y. He, D. Yin, F. R. Yu 和 P. Song, “Bagging R-CNN:复杂交通场景中的目标检测集成方法,” 载于《IEEE国际声学、语音与信号处理会议(ICASSP)论文集》,2023年6月,第1-5页,doi: 10.1109/ICASSP49357.2023.10097085。
[18] T. Liang, H. Bao, W. Pan, and F. Pan, "Traffic sign detection via improved sparse R-CNN for autonomous vehicles," J. Adv. Transp., vol. 2022, pp. 1-16, Mar. 2022.
[18] T. Liang, H. Bao, W. Pan 和 F. Pan, “基于改进稀疏R-CNN的自动驾驶车辆交通标志检测,” 《先进运输杂志》,2022年,第1-16页,3月。
[19] M. Takahashi, K. Iino, H. Watanabe, I. Morinaga, S. Enomoto, X. Shi, A. Sakamoto, and T. Eda, "Category-based memory bank design for traffic surveillance in context R-CNN," Proc. SPIE, vol. 12592, Mar. 2023, Art. no. 125920G, doi: 10.1117/12.2666991.
[19] M. Takahashi, K. Iino, H. Watanabe, I. Morinaga, S. Enomoto, X. Shi, A. Sakamoto 和 T. Eda, “基于类别的记忆库设计用于Context R-CNN中的交通监控,” 《SPIE会议论文集》,第12592卷,2023年3月,文章编号125920G,doi: 10.1117/12.2666991。
[20] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, and P. Luo, "Sparse R-CNN: End-to-end object detection with learnable proposals," in Proc. IEEE/CVF Conf. Com-put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 14449-14458.
[20] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, 和 P. Luo, “Sparse R-CNN:具有可学习提议的端到端目标检测,” 载于 IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集,2021年6月,第14449-14458页。
[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 9992-10002.
[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, 和 B. Guo, “Swin Transformer:基于移位窗口的分层视觉Transformer,” 载于 IEEE/CVF 国际计算机视觉会议(ICCV)论文集,2021年10月,第9992-10002页。
[22] R. Girshick, "Fast R-CNN," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1440-1448.
[22] R. Girshick, “Fast R-CNN,” 载于 IEEE 国际计算机视觉会议(ICCV)论文集,2015年12月,第1440-1448页。
[23] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," 2014, arXiv:1409.1556.
[23] K. Simonyan 和 A. Zisserman, “用于大规模图像识别的非常深的卷积网络,” 2014年,arXiv:1409.1556。
[24] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," in Proc. Eur. Conf. Com-put. Vis., vol. 37. Cham, Switzerland: Springer, Jan. 2014, pp. 1904-1916.
[24] K. He, X. Zhang, S. Ren, 和 J. Sun, “深度卷积网络中的空间金字塔池化用于视觉识别,” 载于欧洲计算机视觉会议论文集,第37卷,瑞士Cham:Springer出版社,2014年1月,第1904-1916页。
[25] R. Qian, Q. Liu, Y. Yue, F. Coenen, and B. Zhang, "Road surface traffic sign detection with hybrid region proposal and fast R-CNN," in Proc. 12th Int. Conf. Natural Comput., Fuzzy Syst. Knowl. Discovery (ICNC-FSKD), Aug. 2016, pp. 555-559, doi: 10.1109/FSKD.2016.7603233.
[25] R. Qian, Q. Liu, Y. Yue, F. Coenen, 和 B. Zhang, “基于混合区域提议和Fast R-CNN的道路表面交通标志检测,” 载于第12届国际自然计算、模糊系统与知识发现会议(ICNC-FSKD)论文集,2016年8月,第555-559页,doi: 10.1109/FSKD.2016.7603233。
[26] Z. Zhang, K. Liu, F. Gao, X. Li, and G. Wang, "Vision-based vehicle detecting and counting for traffic flow analysis," in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2016, pp. 2267-2273, doi: 10.1109/IJCNN.2016.7727480.
[26] Z. Zhang, K. Liu, F. Gao, X. Li, 和 G. Wang, “基于视觉的车辆检测与计数用于交通流量分析,” 载于国际神经网络联合会议(IJCNN)论文集,2016年7月,第2267-2273页,doi: 10.1109/IJCNN.2016.7727480。
[27] Z. Moayed, A. Griffin, and R. Klette, "Traffic intersection monitoring using fusion of GMM-based deep learning classification and geometric warping," in Proc. Int. Conf. Image Vis. Comput. New Zealand (IVCNZ), Dec. 2017, pp. 1-5, doi: 10.1109/IVCNZ.2017.8402465.
[27] Z. Moayed, A. Griffin, 和 R. Klette, “基于GMM的深度学习分类与几何变形融合的交通路口监控,” 载于新西兰图像视觉计算国际会议(IVCNZ)论文集,2017年12月,第1-5页,doi: 10.1109/IVCNZ.2017.8402465。
[28] X. Li, L. Li, F. Flohr, J. Wang, H. Xiong, M. Bernhard, S. Pan, D. M. Gavrila, and K. Li, "A unified framework for concurrent pedestrian and cyclist detection," IEEE Trans. Intell. Transp. Syst., vol. 18, no. 2, pp. 269-281, Feb. 2017, doi: 10.1109/TITS.2016.2567418.
[28] X. Li, L. Li, F. Flohr, J. Wang, H. Xiong, M. Bernhard, S. Pan, D. M. Gavrila, 和 K. Li, “行人和骑行者同时检测的统一框架,” IEEE 智能交通系统汇刊,卷18,第2期,2017年2月,第269-281页,doi: 10.1109/TITS.2016.2567418。
[29] K. S. Htet and M. M. Sein, "Event analysis for vehicle classification using fast RCNN," in Proc. IEEE 9th Global Conf. Consum. Electron. (GCCE), Oct. 2020, pp. 403-404, doi: 10.1109/GCCE50665.2020.9291978.
[29] K. S. Htet 和 M. M. Sein, “基于Fast R-CNN的车辆分类事件分析,” 载于IEEE第9届全球消费电子大会(GCCE)论文集,2020年10月,第403-404页,doi: 10.1109/GCCE50665.2020.9291978。
[30] A. Ali, O. G. Olaleye, B. Dey, and M. Bayoumi, "Fast deep pyramid DPM object detection with region proposal networks," in Proc. IEEE Int. Symp. Signal Process. Inf. Technol. (ISSPIT), Dec. 2017, pp. 168-173, doi: 10.1109/ISSPIT.2017.8388636.
[30] A. Ali, O. G. Olaleye, B. Dey, 和 M. Bayoumi, “基于区域提议网络的快速深度金字塔DPM目标检测,” 载于IEEE国际信号处理与信息技术研讨会(ISSPIT)论文集,2017年12月,第168-173页,doi: 10.1109/ISSPIT.2017.8388636。
[31] K. Wang and W. Zhou, "Pedestrian and cyclist detection based on deep neural network fast R-CNN," Int. J. Adv. Robot. Syst., vol. 16, no. 2, Mar. 2019, doi: 10.1177/1729881419829651.
[31] K. Wang 和 W. Zhou, “基于深度神经网络Fast R-CNN的行人和骑行者检测,” 国际先进机器人系统杂志,卷16,第2期,2019年3月,doi: 10.1177/1729881419829651。
[32] N. Arora, Y. Kumar, R. Karkra, and M. Kumar, "Automatic vehicle detection system in different environment conditions using fast R-CNN," Multimedia Tools Appl., vol. 81, no. 13, pp. 18715-18735, May 2022, doi: 10.1007/s11042-022-12347-8.
[32] N. Arora, Y. Kumar, R. Karkra, 和 M. Kumar, “基于快速区域卷积神经网络(fast R-CNN)的不同环境条件下自动车辆检测系统,” 多媒体工具与应用, 卷81, 期13, 页18715-18735, 2022年5月, doi: 10.1007/s11042-022-12347-8.
[33] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1-14.
[33] S. Ren, K. He, R. Girshick, 和 J. Sun, “Faster R-CNN:基于区域提议网络的实时目标检测方法,” 载于 IEEE 计算机视觉与模式识别会议论文集, 2015年6月, 页1-14.
[34] C. Guindel, D. Martin, and J. M. Armingol, "Fast joint object detection and viewpoint estimation for traffic scene understanding," IEEE Intell. Transp. Syst. Mag., vol. 10, no. 4, pp. 74-86, Winter. 2018.
[34] C. Guindel, D. Martin, 和 J. M. Armingol, “用于交通场景理解的快速联合目标检测与视角估计,” IEEE 智能交通系统杂志, 卷10, 期4, 页74-86, 2018年冬季.
[35] K. Qiao, H. Gu, J. Liu, and P. Liu, "Optimization of traffic sign detection and classification based on faster R-CNN," in Proc. Int. Conf. Com-put. Technol., Electron. Commun. (ICCTEC), Dec. 2017, pp. 608-611, doi: 10.1109/ICCTEC.2017.00137.
[35] K. Qiao, H. Gu, J. Liu, 和 P. Liu, “基于Faster R-CNN的交通标志检测与分类优化,” 载于国际计算技术、电子通信会议(ICCTEC)论文集, 2017年12月, 页608-611, doi: 10.1109/ICCTEC.2017.00137.
[36] G. Wang and X. Ma, "Traffic police gesture recognition using RGB-D and faster R-CNN," in Proc. Int. Conf. Intell. Informat. Biomed. Sci. (ICIIBMS), vol. 3, Oct. 2018, pp. 78-81, doi: 10.1109/ICIIBMS.2018.8549975.
[36] G. Wang 和 X. Ma, “基于RGB-D和Faster R-CNN的交通警察手势识别,” 载于国际智能信息生物医学科学会议(ICIIBMS)论文集, 卷3, 2018年10月, 页78-81, doi: 10.1109/ICIIBMS.2018.8549975.
[37] T. Liu and T. Stathaki, "Faster R-CNN for robust pedestrian detection using semantic segmentation network," Frontiers Neurorobotics, vol. 12, p. 64, Oct. 2018, doi: 10.3389/fnbot.2018.00064.
[37] T. Liu 和 T. Stathaki, “利用语义分割网络的Faster R-CNN实现鲁棒行人检测,” 神经机器人学前沿, 卷12, 页64, 2018年10月, doi: 10.3389/fnbot.2018.00064.
[38] A. Mhalla, T. Chateau, S. Gazzah, and N. E. B. Amara, "An embedded computer-vision system for multi-object detection in traffic surveillance," IEEE Trans. Intell. Transp. Syst., vol. 20, no. 11, pp. 4006-4018, Nov. 2019, doi: 10.1109/TITS.2018.2876614.
[38] A. Mhalla, T. Chateau, S. Gazzah, 和 N. E. B. Amara, “用于交通监控的多目标检测嵌入式计算机视觉系统,” IEEE 智能交通系统汇刊, 卷20, 期11, 页4006-4018, 2019年11月, doi: 10.1109/TITS.2018.2876614.
[39] M. Zinanyuca and D. Arce, "Traffic parameters acquisition system using faster R-CNN deep learning based algorithm," in Proc. IEEE ANDESCON, Oct. 2020, pp. 1-6, doi: 10.1109/ANDESCON50619.2020.9271996.
[39] M. Zinanyuca 和 D. Arce, “基于Faster R-CNN深度学习算法的交通参数采集系统,” 载于 IEEE ANDESCON 会议论文集, 2020年10月, 页1-6, doi: 10.1109/ANDESCON50619.2020.9271996.
[40] X. Gao, L. Chen, K. Wang, X. Xiong, H. Wang, and Y. Li, "Improved traffic sign detection algorithm based on faster R-CNN," Appl. Sci., vol. 12, no. 18, p. 8948, Sep. 2022, doi: 10.3390/app12188948.
[40] X. Gao, L. Chen, K. Wang, X. Xiong, H. Wang, 和 Y. Li, “基于Faster R-CNN的改进交通标志检测算法,” 应用科学, 卷12, 期18, 页8948, 2022年9月, doi: 10.3390/app12188948.
[41] Y. Cui and D. Lei, "Optimizing Internet of Things-based intelligent transportation system's information acquisition using deep learning," IEEE Access, vol. 11, pp. 11804-11810, 2023, doi: 10.1109/ACCESS.2023.3242116.
[41] Y. Cui 和 D. Lei, “利用深度学习优化基于物联网的智能交通系统信息采集,” IEEE Access, 卷11, 页11804-11810, 2023年, doi: 10.1109/ACCESS.2023.3242116.
[42] C. Cao, B. Wang, W. Zhang, X. Zeng, X. Yan, Z. Feng, Y. Liu, and Z. Wu, "An improved faster R-CNN for small object detection," IEEE Access, vol. 7, pp. 106838-106846, 2019, doi: 10.1109/ACCESS.2019.2932731.
[42] C. Cao, B. Wang, W. Zhang, X. Zeng, X. Yan, Z. Feng, Y. Liu, 和 Z. Wu, “一种改进的Faster R-CNN小目标检测方法,” IEEE Access, 卷7, 页106838-106846, 2019年, doi: 10.1109/ACCESS.2019.2932731.
[43] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "SSD: Single shot MultiBox detector," in Computer Vision-ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., Cham, Switzerland: Springer, 2016, pp. 21-37.
[43] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, 和 A. C. Berg, “SSD:单次多框检测器,” 载于计算机视觉-ECCV 2016, B. Leibe, J. Matas, N. Sebe, 和 M. Welling 编, 瑞士楚格: Springer, 2016年, 页21-37.
[44] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proc. IEEE Conf. Com-put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779-788.
[44] J. Redmon, S. Divvala, R. Girshick, 和 A. Farhadi, “你只看一次:统一的实时目标检测,” 载于 IEEE 计算机视觉与模式识别会议 (CVPR) 论文集, 2016年6月, 页779-788。
[45] T. Jin, D. Zhang, F. Ding, Z. Zhang, and M. Zhang, "A vehicle detection algorithm in complex traffic scenes," Proc. SPIE, vol. 11519, Jun. 2020, Art. no. 115190C, doi: 10.1117/12.2573189.
[45] T. Jin, D. Zhang, F. Ding, Z. Zhang, 和 M. Zhang, “复杂交通场景下的车辆检测算法,” SPIE 会议论文集, 卷11519, 2020年6月, 文章编号115190C, doi: 10.1117/12.2573189。
[46] J. Redmon and A. Farhadi, "YOLO9000: Better, faster, stronger," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6517-6525.
[46] J. Redmon 和 A. Farhadi, “YOLO9000:更好、更快、更强,” 载于 IEEE 计算机视觉与模式识别会议 (CVPR) 论文集, 2017年7月, 页6517-6525。
[47] J. Redmon and A. Farhadi, "YOLOv3: An incremental improvement," in Proc. Comput. Vis. Pattern Recognit., Jan. 2018. [Online]. Available: https://arxiv.org/abs/1804.02767
[47] J. Redmon 和 A. Farhadi, “YOLOv3:渐进式改进,” 载于计算机视觉与模式识别会议论文集, 2018年1月。[在线]. 可获取:https://arxiv.org/abs/1804.02767
[48] Ultralytics. (2020). YOLOv5. [Online]. Available: https://github.com/ ultralytics/yolov5
[48] Ultralytics. (2020). YOLOv5. [在线]. 可获取:https://github.com/ultralytics/yolov5
[49] X. Li, Z. Xie, X. Deng, Y. Wu, and Y. Pi, "Traffic sign detection based on improved faster R-CNN for autonomous driving," J. Supercomput., vol. 78, no. 6, pp. 7982-8002, Apr. 2022, doi: 10.1007/s11227-021- 04230-4.
[49] X. Li, Z. Xie, X. Deng, Y. Wu, 和 Y. Pi, “基于改进的 faster R-CNN 的自动驾驶交通标志检测,” 超级计算学报, 卷78, 第6期, 页7982-8002, 2022年4月, doi: 10.1007/s11227-021-04230-4。
[50] R. Hu, H. Li, D. Huang, X. Xu, and K. He, "Traffic sign detection based on driving sight distance in haze environment," IEEE Access, vol. 10, pp. 101124-101136, 2022, doi: 10.1109/ACCESS.2022.3208108.
[50] R. Hu, H. Li, D. Huang, X. Xu, 和 K. He, “基于驾驶视距的雾霾环境交通标志检测,” IEEE Access, 卷10, 页101124-101136, 2022年, doi: 10.1109/ACCESS.2022.3208108。
[51] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask R-CNN," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980-2988.
[51] K. He, G. Gkioxari, P. Dollár, 和 R. Girshick, “Mask R-CNN,” 载于 IEEE 国际计算机视觉会议 (ICCV) 论文集, 2017年10月, 页2980-2988。
[52] S. Sarp, M. Kuzlu, M. Cetin, C. Sazara, and O. Guler, "Detecting floodwater on roadways from image data using mask-R-CNN," in Proc. Int. Conf. Innov. Intell. Syst. Appl. (INISTA), Aug. 2020, pp. 1-6, doi: 10.1109/INISTA49547.2020.9194655.
[52] S. Sarp, M. Kuzlu, M. Cetin, C. Sazara, 和 O. Guler, “基于 mask-R-CNN 的道路洪水检测,” 载于国际创新智能系统应用会议 (INISTA) 论文集, 2020年8月, 页1-6, doi: 10.1109/INISTA49547.2020.9194655。
[53] E. H.-C. Lu, M. Gozdzikiewicz, K.-H. Chang, and J.-M. Ciou, "A hierarchical approach for traffic sign recognition based on shape detection and image classification," Sensors, vol. 22, no. 13, p. 4768, Jun. 2022, doi: 10.3390/s22134768.
[53] E. H.-C. Lu, M. Gozdzikiewicz, K.-H. Chang, 和 J.-M. Ciou, “基于形状检测和图像分类的分层交通标志识别方法,” 传感器, 卷22, 第13期, 页4768, 2022年6月, doi: 10.3390/s22134768。
[54] D. He, Y. Qiu, J. Miao, Z. Zou, K. Li, C. Ren, and G. Shen, "Improved mask R-CNN for obstacle detection of rail transit," Measurement, vol. 190, Feb. 2022, Art. no. 110728.
[54] D. He, Y. Qiu, J. Miao, Z. Zou, K. Li, C. Ren, 和 G. Shen, “改进的 mask R-CNN 用于轨道交通障碍物检测,” 测量学报, 卷190, 2022年2月, 文章编号110728。
[55] L. Lou, Q. Zhang, C. Liu, M. Sheng, J. Liu, and H. Song, "Detecting and counting the moving vehicles using mask R-CNN," in Proc. IEEE 8th Data Driven Control Learn. Syst. Conf. (DDCLS), May 2019, pp. 987-992.
[55] L. Lou, Q. Zhang, C. Liu, M. Sheng, J. Liu, 和 H. Song, “基于 mask R-CNN 的移动车辆检测与计数,” 载于 IEEE 第八届数据驱动控制学习系统会议 (DDCLS) 论文集, 2019年5月, 页987-992。
[56] E. J. Piedad, T.-T. Le, K. Aying, F. K. Pama, and I. Tabale, "Vehicle count system based on time interval image capture method and deep learning mask R-CNN," in Proc. IEEE Region 10 Conf. (TENCON), Oct. 2019, pp. 2675-2679.
[56] E. J. Piedad, T.-T. Le, K. Aying, F. K. Pama, 和 I. Tabale, “基于时间间隔图像采集方法和深度学习 mask R-CNN 的车辆计数系统,” 载于 IEEE 第十区会议 (TENCON) 论文集, 2019年10月, 页2675-2679。
[57] H. Tahir, M. Shahbaz Khan, and M. Owais Tariq, "Performance analysis and comparison of faster R-CNN, mask R-CNN and ResNet50 for the detection and counting of vehicles," in Proc. Int. Conf. Com-put., Commun., Intell. Syst. (ICCCIS), Feb. 2021, pp. 587-594, doi: 10.1109/icccis51004.2021.9397079.
[57] H. Tahir, M. Shahbaz Khan, 和 M. Owais Tariq, “基于 faster R-CNN、mask R-CNN 和 ResNet50 的车辆检测与计数性能分析与比较,” 载于国际计算、通信与智能系统会议(ICCCIS)论文集,2021年2月,第587-594页,doi: 10.1109/icccis51004.2021.9397079。
[58] C. Sazara, M. Cetin, and K. Iftekharuddin, "Image dataset for roadway flooding," Mendeley Data, Amsterdam, The Netherlands, Tech. Rep. V1, 2019. Accessed: Aug. 15, 2024.
[58] C. Sazara, M. Cetin, 和 K. Iftekharuddin, “道路洪水图像数据集,” Mendeley Data,荷兰阿姆斯特丹,技术报告 V1,2019年。访问时间:2024年8月15日。
[59] C. Sazara, M. Cetin, and K. M. Iftekharuddin, "Detecting floodwater on roadways from image data with handcrafted features and deep transfer learning," in Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), Oct. 2019, pp. 804-809, doi: 10.1109/ITSC.2019.8917368.
[59] C. Sazara, M. Cetin, 和 K. M. Iftekharuddin, “基于手工特征与深度迁移学习的道路洪水图像检测,” 载于IEEE智能交通系统会议(ITSC)论文集,2019年10月,第804-809页,doi: 10.1109/ITSC.2019.8917368。
[60] F. Chollet, "Xception: Deep learning with depthwise separable convolutions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1800-1807.
[60] F. Chollet, “Xception:基于深度可分离卷积的深度学习,” 载于IEEE计算机视觉与模式识别会议(CVPR)论文集,2017年7月,第1800-1807页。
[61] G. Jocher. (May 2020). YOLOv5 by Ultralytics. [Online]. Available: https://github.com/ultralytics/
[61] G. Jocher.(2020年5月)Ultralytics发布的YOLOv5。[在线]. 可获取:https://github.com/ultralytics/
[62] J.-P. Lin and M.-T. Sun, "A YOLO-based traffic counting system," in Proc. Conf. Technol. Appl. Artif. Intell. (TAAI), Nov. 2018, pp. 82-85, doi: 10.1109/TAAI.2018.00027.
[62] J.-P. Lin 和 M.-T. Sun, “基于YOLO的交通流量计数系统,” 载于人工智能技术应用会议(TAAI)论文集,2018年11月,第82-85页,doi: 10.1109/TAAI.2018.00027。
[63] M. B. Jensen, K. Nasrollahi, and T. B. Moeslund, "Evaluating state-of-the-art object detector on challenging traffic light data," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 882-888, doi: 10.1109/CVPRW.2017.122.
[63] M. B. Jensen, K. Nasrollahi, 和 T. B. Moeslund, “在复杂交通信号数据上评估最先进目标检测器,” 载于IEEE计算机视觉与模式识别研讨会(CVPRW)论文集,2017年7月,第882-888页,doi: 10.1109/CVPRW.2017.122。
[64] S. P. Rajendran, L. Shine, R. Pradeep, and S. Vijayaraghavan, "Real-time traffic sign recognition using YOLOv3 based detector," in Proc. 10th Int. Conf. Comput., Commun. Netw. Technol. (ICCCNT), Jul. 2019, pp. 1-7, doi: 10.1109/ICCCNT45670.2019.8944890.
[64] S. P. Rajendran, L. Shine, R. Pradeep, 和 S. Vijayaraghavan, “基于YOLOv3的实时交通标志识别,” 载于第十届国际计算、通信与网络技术会议(ICCCNT)论文集,2019年7月,第1-7页,doi: 10.1109/ICCCNT45670.2019.8944890。
[65] J. Yu, X. Ye, and Q. Tu, "Traffic sign detection and recognition in multiimages using a fusion model with YOLO and VGG network," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 16632-16642, Sep. 2022, doi: 10.1109/TITS.2022.3170354.
[65] J. Yu, X. Ye, 和 Q. Tu, “基于YOLO与VGG网络融合模型的多图像交通标志检测与识别,” IEEE智能交通系统汇刊,第23卷第9期,2022年9月,第16632-16642页,doi: 10.1109/TITS.2022.3170354。
[66] Z. Yang, J. Li, and H. Li, "Real-time pedestrian and vehicle detection for autonomous driving," in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2018, pp. 179-184, doi: 10.1109/IVS.2018.8500642.
[66] Z. Yang, J. Li, 和 H. Li, “自动驾驶中的实时行人及车辆检测,” 载于IEEE智能车辆研讨会(IV)论文集,2018年6月,第179-184页,doi: 10.1109/IVS.2018.8500642。
[67] A. Corovic, V. Ilic, S. Duric, M. Marijan, and B. Pavkovic, "The real-time detection of traffic participants using YOLO algorithm," in Proc. 26th Telecommun. Forum (TELFOR), Nov. 2018, pp. 1-4, doi: 10.1109/TELFOR.2018.8611986.
[67] A. Corovic, V. Ilic, S. Duric, M. Marijan, 和 B. Pavkovic, “基于YOLO算法的交通参与者实时检测,” 载于第26届电信论坛(TELFOR)论文集,2018年11月,第1-4页,doi: 10.1109/TELFOR.2018.8611986。
[68] W. Song and S. A. Suandi, "TSR-YOLO: A Chinese traffic sign recognition algorithm for intelligent vehicles in complex scenes," Sensors, vol. 23, no. 2, p. 749, Jan. 2023, doi: 10.3390/s23020749.
[68] W. Song 和 S. A. Suandi, “TSR-YOLO:复杂场景下智能车辆的中国交通标志识别算法,” 传感器,2023年第23卷第2期,749页,doi: 10.3390/s23020749。
[69] L. Xiaomeng, F. Jun, and C. Peng, "Vehicle detection in traffic monitoring scenes based on improved YOLOV5s," in Proc. Int. Conf. Comput. Eng. Artif. Intell. (ICCEAI), Jul. 2022, pp. 467-471, doi: 10.1109/ICCEAI55464.2022.00103.
[69] L. Xiaomeng, F. Jun, 和 C. Peng, “基于改进YOLOV5s的交通监控场景车辆检测,” 载于国际计算机工程与人工智能会议(ICCEAI)论文集, 2022年7月, 第467-471页, doi: 10.1109/ICCEAI55464.2022.00103.
[70] S. Zhang, S. Che, Z. Liu, and X. Zhang, "A real-time and lightweight traffic sign detection method based on ghost-YOLO," Multimedia Tools Appl., vol. 82, no. 17, pp. 26063-26087, Jul. 2023, doi: 10.1007/s11042- 023-14342-z.
[70] S. Zhang, S. Che, Z. Liu, 和 X. Zhang, “基于ghost-YOLO的实时轻量级交通标志检测方法,” 多媒体工具与应用, 第82卷第17期, 第26063-26087页, 2023年7月, doi: 10.1007/s11042-023-14342-z.
[71] C. Sinthia and Md. H. Kabir, "Detection and recognition of Bangladeshi vehicles' nameplates using YOLOV6 and BLPNET," in Proc. Int. Conf. Electr:, Comput. Commun. Eng. (ECCE), Feb. 2023, pp. 1-6.
[71] C. Sinthia 和 Md. H. Kabir, “利用YOLOV6和BLPNET检测与识别孟加拉国车辆车牌,” 载于国际电气、计算与通信工程会议(ECCE)论文集, 2023年2月, 第1-6页.
[72] T. Suwattanapunkul and L.-J. Wang, "The efficient traffic sign detection and recognition for Taiwan road using YOLO model with hybrid dataset," in Proc. 9th Int. Conf. Appl. Syst. Innov. (ICASI), Apr. 2023, pp. 160-162, doi: 10.1109/ICASI57738.2023.10179493.
[72] T. Suwattanapunkul 和 L.-J. Wang, “基于混合数据集的YOLO模型用于台湾道路高效交通标志检测与识别,” 载于第九届国际应用系统创新会议(ICASI)论文集, 2023年4月, 第160-162页, doi: 10.1109/ICASI57738.2023.10179493.
[73] D. Shokri, C. Larouche, and S. Homayouni, "A comparative analysis of multi-label deep learning classifiers for real-time vehicle detection to support intelligent transportation systems," Smart Cities, vol. 6, no. 5, pp. 2982-3004, Oct. 2023.
[73] D. Shokri, C. Larouche, 和 S. Homayouni, “多标签深度学习分类器在实时车辆检测中支持智能交通系统的比较分析,” 智慧城市, 第6卷第5期, 第2982-3004页, 2023年10月.
[74] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248-255.
[74] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, 和 L. Fei-Fei, “ImageNet:大规模分层图像数据库,” 载于IEEE计算机视觉与模式识别会议论文集, 2009年6月, 第248-255页.
[75] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "YOLOv4: Optimal speed and accuracy of object detection," 2020, arXiv:2004.10934.
[75] A. Bochkovskiy, C.-Y. Wang, 和 H.-Y. M. Liao, “YOLOv4:目标检测的最佳速度与精度,” 2020年, arXiv:2004.10934.
[76] C. Dewi, R.-C. Chen, X. Jiang, and H. Yu, "Deep convolutional neural network for enhancing traffic sign recognition developed on YOLO V4," Multimedia Tools Appl., vol. 81, no. 26, pp. 37821-37845, Nov. 2022, doi: 10.1007/s11042-022-12962-5.
[76] C. Dewi, R.-C. Chen, X. Jiang, 和 H. Yu, “基于YOLO V4开发的深度卷积神经网络用于提升交通标志识别,” 多媒体工具与应用, 第81卷第26期, 第37821-37845页, 2022年11月, doi: 10.1007/s11042-022-12962-5.
[77] A. Gomaa and A. Abdalrazik, "Novel deep learning domain adaptation approach for object detection using semi-self building dataset and modified YOLOv4," World Electr. Vehicle J., vol. 15, no. 6, p. 255, Jun. 2024.
[77] A. Gomaa 和 A. Abdalrazik, “基于半自建数据集和改进YOLOv4的目标检测新型深度学习领域自适应方法,” 世界电动车杂志, 第15卷第6期, 第255页, 2024年6月.
[78] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, "RepVGG: Making VGG-style ConvNets great again," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13728-13737.
[78] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, 和 J. Sun, “RepVGG:让VGG风格卷积网络焕发新生,” 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 2021年6月, 第13728-13737页.
[79] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, Y. Li, B. Zhang, Y. Liang, L. Zhou, X. Xu, X. Chu, X. Wei, and X. Wei, "YOLOv6: A single-stage object detection framework for industrial applications," 2022, arXiv:2209.02976.
[79] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, Y. Li, B. Zhang, Y. Liang, L. Zhou, X. Xu, X. Chu, X. Wei, 和 X. Wei, “YOLOv6:面向工业应用的单阶段目标检测框架,” 2022年, arXiv:2209.02976.
[80] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun. 2023, pp. 7464-7475.
[80] C.-Y. Wang, A. Bochkovskiy, 和 H.-Y.-M. Liao, “YOLOv7:可训练的免费礼包集创造实时目标检测新标杆,” 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 美国加州洛杉矶, 2023年6月, 第7464-7475页.
[81] H. Zhang, Y. Ruan, A. Huo, and X. Jiang, "Traffic sign detection based on improved YOLOv7," in Proc. 5th Int. Conf. Intell. Control, Meas. Signal Process. (ICMSP), May 2023, pp. 71-75, doi: 10.1109/ICMSP58539.2023.10170868.
[81] H. Zhang, Y. Ruan, A. Huo, 和 X. Jiang, “基于改进YOLOv7的交通标志检测,” 载于第5届国际智能控制、测量与信号处理会议(ICMSP)论文集, 2023年5月, 第71-75页, doi: 10.1109/ICMSP58539.2023.10170868.
[82] L. Kantorovitch, "On the translocation of masses," Manage. Sci., vol. 5, no. 1, pp. 1-4, Oct. 1958.
[82] L. Kantorovitch, “关于质量转移,” 管理科学(Manage. Sci.), 第5卷第1期, 第1-4页, 1958年10月.
[83] G. Jocher, A. Chaurasia, and J. Qiu. (2023). Ultralytics YOLOv8.
[83] G. Jocher, A. Chaurasia, 和 J. Qiu. (2023). Ultralytics YOLOv8.
[84] A. Ammar, A. Koubaa, M. Ahmed, A. Saad, and B. Benjdira, "Vehicle detection from aerial images using deep learning: A comparative study," Electronics, vol. 10, no. 7, p. 820, Mar. 2021.
[84] A. Ammar, A. Koubaa, M. Ahmed, A. Saad, 和 B. Benjdira, “基于深度学习的航拍图像车辆检测:一项比较研究,” 电子学(Electronics), 第10卷第7期, 第820页, 2021年3月.
[85] H. Zunair, S. Khan, and A. Ben Hamza, "RSUD20K: A dataset for road scene understanding in autonomous driving," 2024, arXiv:2401. 07322.
[85] H. Zunair, S. Khan, 和 A. Ben Hamza, “RSUD20K:自动驾驶道路场景理解数据集,” 2024年, arXiv:2401.07322.
[86] Z. Wang, S. Yang, H. Qin, Y. Liu, and J. Ding, "CCW-YOLO: A modified YOLOv5s network for pedestrian detection in complex traffic scenes," Information, vol. 15, no. 12, p. 762, Dec. 2024.
[86] Z. Wang, S. Yang, H. Qin, Y. Liu, 和 J. Ding, “CCW-YOLO:一种用于复杂交通场景行人检测的改进YOLOv5s网络,” 信息(Information), 第15卷第12期, 第762页, 2024年12月.
[87] Z. Chen, K. Yang, Y. Wu, H. Yang, and X. Tang, "HCLT-YOLO: A hybrid CNN and lightweight transformer architecture for object detection in complex traffic scenes," IEEE Trans. Veh. Technol., early access, Nov. 12, 2024, doi: 10.1109/TVT.2024.3496513.
[87] Z. Chen, K. Yang, Y. Wu, H. Yang, 和 X. Tang, “HCLT-YOLO:一种用于复杂交通场景目标检测的混合卷积神经网络(CNN)与轻量级变换器(transformer)架构,” IEEE车辆技术学报(IEEE Trans. Veh. Technol.), 预发布, 2024年11月12日, doi: 10.1109/TVT.2024.3496513.
[88] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,and N. Houlsby,"An image is worth
[88] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, 和 N. Houlsby, “一张图片胜过千言万语:大规模图像识别的变换器(transformers),” 2020年, arXiv:2010.11929.
[89] A. Abdelraouf, M. Abdel-Aty, and Y. Wu, "Using vision transformers for spatial-context-aware rain and road surface condition detection on freeways," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 18546-18556, Oct. 2022.
[89] A. Abdelraouf, M. Abdel-Aty, 和 Y. Wu, “利用视觉变换器进行高速公路雨天及路面状况的空间上下文感知检测,” IEEE智能交通系统学报(IEEE Trans. Intell. Transp. Syst.), 第23卷第10期, 第18546-18556页, 2022年10月.
[90] S. Zhao, H. Li, Q. Ke, L. Liu, and R. Zhang, "Action-ViT: Pedestrian intent prediction in traffic scenes," IEEE Signal Process. Lett., vol. 29, pp. 324-328, 2022.
[90] S. Zhao, H. Li, Q. Ke, L. Liu, 和 R. Zhang, “Action-ViT:交通场景中行人意图预测,” IEEE信号处理快报(IEEE Signal Process. Lett.), 第29卷, 第324-328页, 2022年.
[91] M. Kang, W. Lee, K. Hwang, and Y. Yoon, "Vision transformer for detecting critical situations and extracting functional scenario for automated vehicle safety assessment," Sustainability, vol. 14, no. 15, p. 9680, Aug. 2022.
[91] M. Kang, W. Lee, K. Hwang, 和 Y. Yoon, “用于自动驾驶安全评估的视觉变换器:关键情境检测与功能场景提取,” 可持续性(Sustainability), 第14卷第15期, 第9680页, 2022年8月.
[92] J. Wurst, L. Balasubramanian, M. Botsch, and W. Utschick, "Novelty detection and analysis of traffic scenario infrastructures in the latent space of a vision transformer-based triplet autoencoder," in Proc. IEEE Intell. Vehicles Symp. (IV), Jul. 2021, pp. 1304-1311.
[92] J. Wurst, L. Balasubramanian, M. Botsch, 和 W. Utschick, “基于视觉变换器的三元组自编码器潜在空间中交通场景基础设施的新颖性检测与分析,” 载于IEEE智能车辆研讨会(IV)论文集, 2021年7月, 第1304-1311页.
[93] J. Wurst, A. F. Fernández, M. Botsch, and W. Utschick, "An entropy based outlier score and its application to novelty detection for road infrastructure images," in Proc. IEEE Intell. Vehicles Symp. (IV), Oct. 2020, pp. 1436-1443.
[93] J. Wurst, A. F. Fernández, M. Botsch, 和 W. Utschick, “基于熵的异常值评分及其在道路基础设施图像新颖性检测中的应用,” 载于IEEE智能车辆研讨会(IV)论文集, 2020年10月, 第1436-1443页.
[94] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in Proc. Eur. Conf. Comput. Vis., Jan. 2020, pp. 213-229.
[94] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, 和 S. Zagoruyko, “基于变换器的端到端目标检测,” 载于欧洲计算机视觉会议(Eur. Conf. Comput. Vis.)论文集, 2020年1月, 第213-229页.
[95] J. Xia, M. Li, W. Liu, and X. Chen, "DSRA-DETR: An improved DETR for multiscale traffic sign detection," Sustainability, vol. 15, no. 14, p. 10862, Jul. 2023.
[95] J. Xia, M. Li, W. Liu, 和 X. Chen, “DSRA-DETR:一种改进的DETR用于多尺度交通标志检测,” 《可持续性》(Sustainability), 第15卷,第14期,页10862,2023年7月。
[96] H. Wei, Q. Zhang, Y. Qian, Z. Xu, and J. Han, "MTSDet: Multi-scale traffic sign detection with attention and path aggregation," Appl. Intell., vol. 53, no. 1, pp. 238-250, Jan. 2023.
[96] H. Wei, Q. Zhang, Y. Qian, Z. Xu, 和 J. Han, “MTSDet:基于注意力机制和路径聚合的多尺度交通标志检测,” 《应用智能》(Appl. Intell.), 第53卷,第1期,页238-250,2023年1月。
[97] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, "Fast convergence of DETR with spatially modulated co-attention," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3601-3610.
[97] P. Gao, M. Zheng, X. Wang, J. Dai, 和 H. Li, “基于空间调制共注意力的DETR快速收敛,” 载于IEEE/CVF国际计算机视觉会议(ICCV)论文集,2021年10月,页3601-3610。
[98] T. Liang, H. Bao, W. Pan, X. Fan, and H. Li, "DetectFormer: Category-assisted transformer for traffic scene object detection," Sensors, vol. 22, no. 13, p. 4833, Jun. 2022.
[98] T. Liang, H. Bao, W. Pan, X. Fan, 和 H. Li, “DetectFormer:基于类别辅助的交通场景目标检测Transformer,” 《传感器》(Sensors), 第22卷,第13期,页4833,2022年6月。
[99] T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," 2016, arXiv:1609.02907.
[99] T. N. Kipf 和 M. Welling, “基于图卷积网络的半监督分类,” 2016年,arXiv:1609.02907。
[100] S. Mylavarapu, M. Sandhu, P. Vijayan, K. M. Krishna, B. Ravindran, and A. Namboodiri, "Towards accurate vehicle behaviour classification with multi-relational graph convolutional networks," in Proc. IEEE Intell. Vehicles Symp. (IV), Oct. 2020, pp. 321-327.
[100] S. Mylavarapu, M. Sandhu, P. Vijayan, K. M. Krishna, B. Ravindran, 和 A. Namboodiri, “基于多关系图卷积网络的精确车辆行为分类研究,” 载于IEEE智能车辆研讨会(IV)论文集,2020年10月,页321-327。
[101] K. Liu, Y. Zheng, J. Yang, H. Bao, and H. Zeng, "Chinese traffic police gesture recognition based on graph convolutional network in natural scene," Appl. Sci., vol. 11, no. 24, p. 11951, Dec. 2021.
[101] K. Liu, Y. Zheng, J. Yang, H. Bao, 和 H. Zeng, “基于图卷积网络的自然场景下中国交通警察手势识别,” 《应用科学》(Appl. Sci.), 第11卷,第24期,页11951,2021年12月。
[102] Z. Fang, W. Zhang, Z. Guo, R. Zhi, B. Wang, and F. Flohr, "Traffic police gesture recognition by pose graph convolutional networks," in Proc. IEEE Intell. Vehicles Symp. (IV), Oct. 2020, pp. 1833-1838.
[102] Z. Fang, W. Zhang, Z. Guo, R. Zhi, B. Wang, 和 F. Flohr, “基于姿态图卷积网络的交通警察手势识别,” 载于IEEE智能车辆研讨会(IV)论文集,2020年10月,页1833-1838。
[103] J. Lian, Z. Wang, L. Li, and Y. Zhou, "The understanding of traffic police intention based on visual awareness," Neural Process. Lett., vol. 54, no. 4, pp. 2843-2859, Aug. 2022.
[103] J. Lian, Z. Wang, L. Li, 和 Y. Zhou, “基于视觉感知的交通警察意图理解,” 《神经处理快报》(Neural Process. Lett.), 第54卷,第4期,页2843-2859,2022年8月。
[104] F. Xu, F. Xu, J. Xie, C.-M. Pun, H. Lu, and H. Gao, "Action recognition framework in traffic scene for autonomous driving system," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 11, pp. 22301-22311, Nov. 2022.
[104] F. Xu, F. Xu, J. Xie, C.-M. Pun, H. Lu, 和 H. Gao, “自动驾驶系统中交通场景动作识别框架,” 《IEEE智能交通系统汇刊》(IEEE Trans. Intell. Transp. Syst.), 第23卷,第11期,页22301-22311,2022年11月。
[105] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, "OpenPose: Realtime multi-person 2D pose estimation using part affinity fields," IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 172-186, Jan. 2021.
[105] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, 和 Y. Sheikh, “OpenPose:基于部件亲和场的实时多人二维姿态估计,” 《IEEE模式分析与机器智能汇刊》(IEEE Trans. Pattern Anal. Mach. Intell.), 第43卷,第1期,页172-186,2021年1月。
[106] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lió, and Y. Bengio, "Graph attention networks," 2017, arXiv:1710.10903.
[106] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lió, 和 Y. Bengio, “图注意力网络,” 2017年,arXiv:1710.10903。
[107] P. N. Chowdhury, P. Shivakumara, S. Kanchan, R. Raghavendra, U. Pal, T. Lu, and D. Lopresti, "Graph attention network for detecting license plates in crowded street scenes," Pattern Recognit. Lett., vol. 140, pp. 18-25, Dec. 2020, doi: 10.1016/j.patrec.2020.09.018.
[107] P. N. Chowdhury, P. Shivakumara, S. Kanchan, R. Raghavendra, U. Pal, T. Lu, 和 D. Lopresti, “基于图注意力网络的拥挤街景车牌检测,” 《模式识别快报》(Pattern Recognit. Lett.), 第140期,页18-25,2020年12月,doi: 10.1016/j.patrec.2020.09.018。
[108] Z. Wang, Z. Li, J. Leng, M. Li, and L. Bai, "Multiple pedestrian tracking with graph attention map on urban road scene," IEEE Trans. Intell. Transp. Syst., vol. 24, no. 8, pp. 8567-8579, Aug. 2023.
[108] Z. Wang, Z. Li, J. Leng, M. Li, 和 L. Bai, “基于图注意力图的城市道路多行人跟踪,” 《IEEE智能交通系统汇刊》(IEEE Trans. Intell. Transp. Syst.), 第24卷,第8期,页8567-8579,2023年8月。
[109] T. Monninger, J. Schmidt, J. Rupprecht, D. Raba, J. Jordan, D. Frank, S. Staab, and K. Dietmayer, "SCENE: Reasoning about traffic scenes using heterogeneous graph neural networks," IEEE Robot. Autom. Lett., vol. 8, no. 3, pp. 1531-1538, Mar. 2023.
[109] T. Monninger, J. Schmidt, J. Rupprecht, D. Raba, J. Jordan, D. Frank, S. Staab, 和 K. Dietmayer, “SCENE:使用异构图神经网络推理交通场景,” IEEE机器人与自动化快报, 第8卷, 第3期, 页1531-1538, 2023年3月。
[110] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, "How powerful are graph neural networks?" 2018, arXiv:1810.00826.
[110] K. Xu, W. Hu, J. Leskovec, 和 S. Jegelka, “图神经网络的表达能力有多强?” 2018, arXiv:1810.00826。
[111] A. V. Malawade, S.-Y. Yu, B. Hsu, H. Kaeley, A. Karra, and M. A. A. Faruque, "roadscene2vec: A tool for extracting and embedding road scene-graphs," Knowl.-Based Syst., vol. 242, Apr. 2022, Art. no. 108245.
[111] A. V. Malawade, S.-Y. Yu, B. Hsu, H. Kaeley, A. Karra, 和 M. A. A. Faruque, “roadscene2vec:一种用于提取和嵌入道路场景图的工具,” 知识基础系统, 第242卷, 2022年4月, 文章编号108245。
[112] G. A. Noghre, V. Katariya, A. D. Pazho, C. Neff, and H. Tabkhi, "Pishgu: Universal path prediction network architecture for real-time cyber-physical edge systems," 2022, arXiv:2210.08057.
[112] G. A. Noghre, V. Katariya, A. D. Pazho, C. Neff, 和 H. Tabkhi, “Pishgu:面向实时网络物理边缘系统的通用路径预测网络架构,” 2022, arXiv:2210.08057。
[113] Y. Tian, A. Carballo, R. Li, and K. Takeda, "RSG-search: Semantic traffic scene retrieval using graph-based scene representation," in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2023, pp. 1-8.
[113] Y. Tian, A. Carballo, R. Li, 和 K. Takeda, “RSG-search:基于图的场景表示的语义交通场景检索,” 载于IEEE智能车辆研讨会(IV), 2023年6月, 页1-8。
[114] J. Wurst, L. Balasubramanian, M. Botsch, and W. Utschick, "Expert-LaSTS: Expert-knowledge guided latent space for traffic scenarios," in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2022, pp. 484-491.
[114] J. Wurst, L. Balasubramanian, M. Botsch, 和 W. Utschick, “Expert-LaSTS:基于专家知识引导的交通场景潜在空间,” 载于IEEE智能车辆研讨会(IV), 2022年6月, 页484-491。
[115] M. Mendieta and H. Tabkhi, "CARPe posterum: A convolutional approach for real-time pedestrian path prediction," in Proc. AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 3, pp. 2346-2354.
[115] M. Mendieta 和 H. Tabkhi, “CARPe posterum:一种用于实时行人路径预测的卷积方法,” 载于AAAI人工智能会议, 2021年5月, 第35卷, 第3期, 页2346-2354。
[116] S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic routing between capsules," in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Jan. 2017, pp. 3859-3869.
[116] S. Sabour, N. Frosst, 和 G. E. Hinton, “胶囊间的动态路由,” 载于神经信息处理系统进展会议, 第30卷, 2017年1月, 页3859-3869。
[117] A. Dinesh Kumar, "Novel deep learning model for traffic sign detection using capsule networks," 2018, arXiv:1805.04424.
[117] A. Dinesh Kumar, “基于胶囊网络的交通标志检测新型深度学习模型,” 2018, arXiv:1805.04424。
[118] X. Liu, W. Q. Yan, and N. Kasabov, "Vehicle-related scene segmentation using CapsNets," in Proc. 35th Int. Conf. Image Vis. Comput. New Zealand (IVCNZ), Nov. 2020, pp. 1-6, doi: 10.1109/IVCNZ51579.2020.9290664.
[118] X. Liu, W. Q. Yan, 和 N. Kasabov, “基于CapsNets的车辆相关场景分割,” 载于第35届新西兰图像视觉计算国际会议(IVCNZ), 2020年11月, 页1-6, doi: 10.1109/IVCNZ51579.2020.9290664。
[119] Z. Hao, "The method of recognizing traffic signs based on the improved capsule network," in Proc. Int. Conf. Comput. Eng. Intell. Control (ICCEIC), Nov. 2020, pp. 22-26.
[119] Z. Hao, “基于改进胶囊网络的交通标志识别方法,” 载于国际计算工程与智能控制会议(ICCEIC), 2020年11月, 页22-26。
[120] X. Liu and W. Q. Yan, "Traffic-light sign recognition using capsule network," Multimedia Tools Appl., vol. 80, no. 10, pp. 15161-15171, Apr. 2021, doi: 10.1007/s11042-020-10455-x.
[120] X. Liu 和 W. Q. Yan, “基于胶囊网络的交通信号灯标志识别,” 多媒体工具与应用, 第80卷, 第10期, 页15161-15171, 2021年4月, doi: 10.1007/s11042-020-10455-x。
[121] W. Yang and W. Zhang, "Real-time traffic signs detection based on YOLO network model," in Proc. Int. Conf. Cyber-Enabled Dis-trib. Comput. Knowl. Discovery (CyberC), Oct. 2020, pp. 354-357, doi: 10.1109/CyberC49757.2020.00066.
[121] W. Yang 和 W. Zhang, “基于YOLO网络模型的实时交通标志检测,” 载于国际网络启用分布式计算与知识发现会议(CyberC), 2020年10月, 页354-357, doi: 10.1109/CyberC49757.2020.00066。
[122] Y. Liu, G. Shi, Y. Li, and Z. Zhao, "M-YOLO: Traffic sign detection algorithm applicable to complex scenarios," Symmetry, vol. 14, no. 5, p. 952, May 2022, doi: 10.3390/sym14050952.
[122] Y. Liu, G. Shi, Y. Li, 和 Z. Zhao, “M-YOLO:适用于复杂场景的交通标志检测算法,” 对称性, 第14卷, 第5期, 页952, 2022年5月, doi: 10.3390/sym14050952。
[123] C. Dewi, R.-C. Chen, Y.-T. Liu, X. Jiang, and K. D. Hartomo, "YOLO V4 for advanced traffic sign recognition with synthetic training data generated by various GAN," IEEE Access, vol. 9, pp. 97228-97242, 2021, doi: 10.1109/ACCESS.2021.3094201.
[123] C. Dewi, R.-C. Chen, Y.-T. Liu, X. Jiang, 和 K. D. Hartomo, "YOLO V4用于利用多种生成对抗网络(GAN)生成的合成训练数据进行高级交通标志识别," IEEE Access, 第9卷, 页97228-97242, 2021, doi: 10.1109/ACCESS.2021.3094201.
[124] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," in Advances in Neural Information Processing Systems, vol. 27, Z. Ghahra-mani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, Eds., Red Hook, NY, USA: Curran Associates, 2014.
[124] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, 和 Y. Bengio, "生成对抗网络(Generative adversarial nets)," 载于神经信息处理系统进展(Advances in Neural Information Processing Systems), 第27卷, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, 和 K. Q. Weinberger编辑, 纽约雷德胡克, 美国: Curran Associates, 2014.
[125] K. Zhang, X. Feng, N. Jia, L. Zhao, and Z. He, "TSR-GAN: Generative adversarial networks for traffic state reconstruction with time space diagrams," Phys. A, Stat. Mech. Appl., vol. 591, Apr. 2022, Art. no. 126788.
[125] K. Zhang, X. Feng, N. Jia, L. Zhao, 和 Z. He, "TSR-GAN:基于生成对抗网络的交通状态重建及时空图分析," 物理A,统计力学及其应用(Phys. A, Stat. Mech. Appl.), 第591卷, 2022年4月, 文章编号126788.
[126] P. König, S. Aigner, and M. Körner, "Enhancing traffic scene predictions with generative adversarial networks," in Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), Oct. 2019, pp. 1768-1775.
[126] P. König, S. Aigner, 和 M. Körner, "利用生成对抗网络提升交通场景预测," 载于IEEE智能交通系统会议(ITSC)论文集, 2019年10月, 页1768-1775.
[127] Y. Cai, L. Dai, H. Wang, and Z. Li, "Multi-target pan-class intrinsic relevance driven model for improving semantic segmentation in autonomous driving," IEEE Trans. Image Process., vol. 30, pp. 9069-9084, 2021.
[127] Y. Cai, L. Dai, H. Wang, 和 Z. Li, "基于多目标泛类内在相关性的模型提升自动驾驶语义分割性能," IEEE图像处理汇刊(IEEE Trans. Image Process.), 第30卷, 页9069-9084, 2021.
[128] W. Xu, N. Souly, and P. P. Brahma, "Reliability of GAN generated data to train and validate perception systems for autonomous vehicles," in Proc. IEEE Winter Conf. Appl. Comput. Vis. Workshops (WACVW), Jan. 2021, pp. 171-180.
[128] W. Xu, N. Souly, 和 P. P. Brahma, "用于训练和验证自动驾驶感知系统的生成对抗网络(GAN)生成数据的可靠性," 载于IEEE冬季计算机视觉应用会议研讨会(WACVW)论文集, 2021年1月, 页171-180.
[129] M. Uricár, G. Sistu, H. Rashed, A. Vobecký, V. R. Kumar, P. Krížek, F. Bürger, and S. Yogamani, "Let's get dirty: GAN based data augmentation for camera lens soiling detection in autonomous driving," in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 766-775.
[129] M. Uricár, G. Sistu, H. Rashed, A. Vobecký, V. R. Kumar, P. Krížek, F. Bürger, 和 S. Yogamani, "让我们弄脏它:基于GAN的数据增强用于自动驾驶摄像头镜头污渍检测," 载于IEEE冬季计算机视觉应用会议(WACV)论文集, 2021年1月, 页766-775.
[130] X. Cheng, J. Zhou, J. Song, and X. Zhao, "A highway traffic image enhancement algorithm based on improved GAN in complex weather conditions," IEEE Trans. Intell. Transp. Syst., vol. 24, no. 8, pp. 8716-8726, Aug. 2023.
[130] X. Cheng, J. Zhou, J. Song, 和 X. Zhao, "基于改进生成对抗网络(GAN)的复杂天气条件下高速公路交通图像增强算法," IEEE智能交通系统汇刊(IEEE Trans. Intell. Transp. Syst.), 第24卷第8期, 页8716-8726, 2023年8月.
[131] C. Jiqing, W. Depeng, L. Teng, L. Tian, and W. Huabin, "All-weather road drivable area segmentation method based on CycleGAN," Vis. Comput., vol. 39, no. 10, pp. 5135-5151, Oct. 2023.
[131] C. Jiqing, W. Depeng, L. Teng, L. Tian, 和 W. Huabin, "基于CycleGAN的全天候道路可行驶区域分割方法," 视觉计算(Vis. Comput.), 第39卷第10期, 页5135-5151, 2023年10月.
[132] A. Mukherjee, A. Joshi, C. Hegde, and S. Sarkar, "Semantic domain adaptation for deep classifiers via gan-based data augmentation," in Proc. Conf. Neural Inf. Process. Syst. Workshops, 2019, pp. 1-7.
[132] A. Mukherjee, A. Joshi, C. Hegde, 和 S. Sarkar, "基于GAN的数据增强实现深度分类器的语义域适应," 载于神经信息处理系统会议研讨会论文集, 2019, 页1-7.
[133] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese, "SoPhie: An attentive GAN for predicting paths compliant to social and physical constraints," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1349-1358.
[133] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, 和 S. Savarese, "SoPhie:一种符合社会和物理约束的注意力生成对抗网络(GAN)路径预测方法," 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 2019年6月, 页1349-1358.
[134] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, "Photo-realistic single image super-resolution using a generative adversarial network," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 105-114.
[134] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, 和 W. Shi, "基于生成对抗网络的照片级真实感单幅图像超分辨率重建," 载于IEEE计算机视觉与模式识别会议(CVPR)论文集, 2017年7月, 页105-114.
[135] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, "DeblurGAN: Blind motion deblurring using conditional adversarial networks," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8183-8192.
[135] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, 和 J. Matas, “DeblurGAN:使用条件对抗网络的盲运动去模糊,” 载于 IEEE/CVF 计算机视觉与模式识别会议论文集, 2018年6月, 页码 8183-8192。
[136] S. Aigner and M. Körner, "FutureGAN: Anticipating the future frames of video sequences using spatio-temporal 3D convolutions in progressively growing GANs," 2018, arXiv:1810.01325.
[136] S. Aigner 和 M. Körner, “FutureGAN:利用时空三维卷积在逐步增长的生成对抗网络中预测视频序列的未来帧,” 2018, arXiv:1810.01325。
[137] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2242-2251.
[137] J.-Y. Zhu, T. Park, P. Isola, 和 A. A. Efros, “使用循环一致性对抗网络进行无配对图像到图像的转换,” 载于 IEEE 国际计算机视觉会议(ICCV)论文集, 2017年10月, 页码 2242-2251。
[138] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, "AttGAN: Facial attribute editing by only changing what you want," IEEE Trans. Image Process., vol. 28, no. 11, pp. 5464-5478, Nov. 2019.
[138] Z. He, W. Zuo, M. Kan, S. Shan, 和 X. Chen, “AttGAN:仅通过改变所需内容进行面部属性编辑,” IEEE 图像处理汇刊, 第28卷第11期, 页码 5464-5478, 2019年11月。
[139] F. Lateef, M. Kas, A. Chahi, and Y. Ruichek, "A two-stream conditional generative adversarial network for improving semantic predictions in urban driving scenes," Eng. Appl. Artif. Intell., vol. 133, Jul. 2024, Art. no. 108290.
[139] F. Lateef, M. Kas, A. Chahi, 和 Y. Ruichek, “一种用于提升城市驾驶场景语义预测的双流条件生成对抗网络,” 工程应用人工智能, 第133卷, 2024年7月, 文章编号 108290。
[140] D. P. Kingma and M. Welling, "Auto-encoding variational Bayes," 2013, arXiv:1312.6114.
[140] D. P. Kingma 和 M. Welling, “自动编码变分贝叶斯,” 2013, arXiv:1312.6114。
[141] L. Gou, L. Zou, N. Li, M. Hofmann, A. K. Shekar, A. Wendt, and L. Ren, "VATLD: A visual analytics system to assess, understand and improve traffic light detection," IEEE Trans. Vis. Comput. Graph., vol. 27, no. 2, pp. 261-271, Feb. 2021.
[141] L. Gou, L. Zou, N. Li, M. Hofmann, A. K. Shekar, A. Wendt, 和 L. Ren, “VATLD:一个用于评估、理解和改进交通信号灯检测的可视分析系统,” IEEE 可视化与计算机图形汇刊, 第27卷第2期, 页码 261-271, 2021年2月。
[142] Z. Chen and L. Liu, "NSS-VAEs: Generative scene decomposition for visual navigable space construction," 2021, arXiv:2111.01127.
[142] Z. Chen 和 L. Liu, “NSS-VAEs:用于视觉可导航空间构建的生成场景分解,” 2021, arXiv:2111.01127。
[143] V. K. Sundar, S. Ramakrishna, Z. Rahiminasab, A. Easwaran, and A. Dubey, "Out-of-distribution detection in multi-label datasets using latent space of B-VAE," in Proc. IEEE Secur. Privacy Workshops (SPW), May 2020, pp. 250-255.
[143] V. K. Sundar, S. Ramakrishna, Z. Rahiminasab, A. Easwaran, 和 A. Dubey, “利用B-VAE潜在空间进行多标签数据集的分布外检测,” 载于 IEEE 安全与隐私研讨会(SPW)论文集, 2020年5月, 页码 250-255。
[144] S. Tan, K. Wong, S. Wang, S. Manivasagam, M. Ren, and R. Urtasun, "SceneGen: Learning to generate realistic traffic scenes," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 892-901.
[144] S. Tan, K. Wong, S. Wang, S. Manivasagam, M. Ren, 和 R. Urtasun, “SceneGen:学习生成真实交通场景,” 载于 IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集, 2021年6月, 页码 892-901。
[145] W. Ding, H. Lin, B. Li, and D. Zhao, "Semantically adversarial scenario generation with explicit knowledge guidance," 2021, arXiv:2106.04066.
[145] W. Ding, H. Lin, B. Li, 和 D. Zhao, “带有显式知识引导的语义对抗场景生成,” 2021, arXiv:2106.04066。
[146] N. Aslam and M. H. Kolekar, "A-VAE: Attention based variational autoencoder for traffic video anomaly detection," in Proc. IEEE 8th Int. Conf. Converg. Technol. (I2CT), Apr. 2023, pp. 1-7.
[146] N. Aslam 和 M. H. Kolekar, “A-VAE:基于注意力的变分自编码器用于交通视频异常检测,” 载于 IEEE 第八届融合技术国际会议(I2CT)论文集, 2023年4月, 页码 1-7。
[147] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner,"Understanding disentangling in
[147] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, 和 A. Lerchner, “理解
[148] Z. Li, C. Zhang, G. Meng, and Y. Liu, "Joint haze image synthesis and dehazing with mmd-vae losses," 2019, arXiv:1905.05947.
[148] Z. Li, C. Zhang, G. Meng, 和 Y. Liu, “基于mmd-vae损失的联合雾图像合成与去雾,” 2019, arXiv:1905.05947。
[149] Q. Tian and J. Sun, "Cluster-based dual-branch contrastive learning for unsupervised domain adaptation person re-identification," Knowl.-Based Syst., vol. 280, Nov. 2023, Art. no. 111026.
[149] Q. Tian 和 J. Sun, “基于聚类的双分支对比学习用于无监督领域自适应行人重识别,” 知识基系统, 第280卷, 2023年11月, 文章编号 111026。
[150] X. Gao, Z. Chen, J. Wei, R. Wang, and Z. Zhao, "Deep mutual distillation for unsupervised domain adaptation person re-identification," IEEE Trans. Multimedia, early access, Sep. 12, 2024, doi: 10.1109/TMM.2024.3459637.
[150] X. Gao, Z. Chen, J. Wei, R. Wang, 和 Z. Zhao, “用于无监督领域自适应行人重识别的深度互相蒸馏,” IEEE 多媒体学报, 预发布, 2024年9月12日, doi: 10.1109/TMM.2024.3459637.
[151] G. Mattolin, L. Zanella, E. Ricci, and Y. Wang, "ConfMix: Unsupervised domain adaptation for object detection via confidence-based mixing," in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023, pp. 423-433.
[151] G. Mattolin, L. Zanella, E. Ricci, 和 Y. Wang, “ConfMix:基于置信度混合的无监督领域自适应目标检测,” 载于 IEEE/CVF 冬季计算机视觉应用会议 (WACV) 论文集, 2023年1月, 页423-433.
[152] D. Shenaj, E. Fanì, M. Toldo, D. Caldarola, A. Tavera, U. Michieli, M. Ciccone, P. Zanuttigh, and B. Caputo, "Learning across domains and devices: Style-driven source-free domain adaptation in clustered federated learning," in Proc. IEEE/CVF Winter Conf. Appl. Com-put. Vis. (WACV), Jan. 2023, pp. 444-454.
[152] D. Shenaj, E. Fanì, M. Toldo, D. Caldarola, A. Tavera, U. Michieli, M. Ciccone, P. Zanuttigh, 和 B. Caputo, “跨域与跨设备学习:聚类联邦学习中的风格驱动无源域自适应,” 载于 IEEE/CVF 冬季计算机视觉应用会议 (WACV) 论文集, 2023年1月, 页444-454.
[153] Y. Zheng, D. Huang, S. Liu, and Y. Wang, "Cross-domain object detection through coarse-to-fine feature adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 13763-13772.
[153] Y. Zheng, D. Huang, S. Liu, 和 Y. Wang, “通过粗到细特征适应实现跨域目标检测,” 载于 IEEE/CVF 计算机视觉与模式识别会议 (CVPR) 论文集, 2020年6月, 页13763-13772.
[154] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, "Big self-supervised models are strong semi-supervised learners," in Proc. Adv. Neural Inf. Process. Syst., Jan. 2020, pp. 22243-22255.
[154] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, 和 G. E. Hinton, “大型自监督模型是强大的半监督学习者,” 载于神经信息处理系统大会 (NeurIPS), 2020年1月, 页22243-22255.
[155] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, "A simple framework for contrastive learning of visual representations," in Proc. 37th Int. Conf. Mach. Learn., Jan. 2020, pp. 1597-1607.
[155] T. Chen, S. Kornblith, M. Norouzi, 和 G. E. Hinton, “视觉表征对比学习的简单框架,” 载于第37届国际机器学习大会 (ICML), 2020年1月, 页1597-1607.
[156] G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," 2015, arXiv:1503.02531.
[156] G. Hinton, O. Vinyals, 和 J. Dean, “神经网络中的知识蒸馏,” 2015年, arXiv:1503.02531.
[157] S. Kullback and R. A. Leibler, "On information and sufficiency," Ann. Math. Statist., vol. 22, no. 1, pp. 79-86, Mar. 1951.
[157] S. Kullback 和 R. A. Leibler, “论信息与充分性,” 数学统计年刊, 第22卷第1期, 页79-86, 1951年3月.
[158] A. Gretton, K. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola, "A kernel two-sample test," J. Mach. Learn. Res., vol. 13, no. 1, pp. 723-773, Mar. 2012.
[158] A. Gretton, K. Borgwardt, M. J. Rasch, B. Schölkopf, 和 A. J. Smola, “核两样本检验,” 机器学习研究杂志, 第13卷第1期, 页723-773, 2012年3月.
[159] Z. Zhao, S. Wei, Q. Chen, D. Li, Y. Yang, Y. Peng, and Y. Liu, "Masked retraining teacher-student framework for domain adaptive object detection," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 18993-19003.
[159] Z. Zhao, S. Wei, Q. Chen, D. Li, Y. Yang, Y. Peng, 和 Y. Liu, “用于领域自适应目标检测的掩码重训练师生框架,” 载于 IEEE/CVF 国际计算机视觉大会 (ICCV), 2023年10月, 页18993-19003.
[160] K. Gong, S. Li, S. Li, R. Zhang, C. H. Liu, and Q. Chen, "Improving transferability for domain adaptive detection transformers," in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 1543-1551.
[160] K. Gong, S. Li, S. Li, R. Zhang, C. H. Liu, 和 Q. Chen, “提升领域自适应检测变换器的迁移能力,” 载于第30届ACM国际多媒体会议, 2022年10月, 页1543-1551.
[161] G. Li, Z. Ji, Y. Chang, S. Li, X. Qu, and D. Cao, "ML-ANet: A transfer learning approach using adaptation network for multi-label image classification in autonomous driving," Chin. J. Mech. Eng., vol. 34, no. 1, p. 78, Dec. 2021.
[161] G. Li, Z. Ji, Y. Chang, S. Li, X. Qu, 和 D. Cao, “ML-ANet:一种用于自动驾驶多标签图像分类的迁移学习适应网络,” 中国机械工程学报, 第34卷第1期, 页78, 2021年12月.
[162] D. Mekhazni, A. Bhuiyan, G. Ekladious, and E. Granger, "Unsupervised domain adaptation in the dissimilarity space for person re-identification," in Computer Vision-ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., Cham, Switzerland: Springer, 2020, pp. 159-174.
[162] D. Mekhazni, A. Bhuiyan, G. Ekladious, 和 E. Granger, “基于不相似度空间的无监督领域自适应行人重识别,” 载于计算机视觉-ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, 和 J.-M. Frahm 编, 瑞士Cham: Springer, 2020年, 页159-174.
[163] C.-Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht, "Sliced Wasserstein discrepancy for unsupervised domain adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 10277-10287.
[163] C.-Y. Lee, T. Batra, M. H. Baig, 和 D. Ulbricht, “用于无监督领域自适应的切片Wasserstein差异(Sliced Wasserstein discrepancy),” 载于 IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集,2019年6月,第10277-10287页。
[164] A.-D. Doan, B. L. Nguyen, S. Gupta, I. Reid, M. Wagner, and T.-J. Chin, "Assessing domain gap for continual domain adaptation in object detection," Comput. Vis. Image Understand., vol. 238, Jan. 2024, Art. no. 103885.
[164] A.-D. Doan, B. L. Nguyen, S. Gupta, I. Reid, M. Wagner, 和 T.-J. Chin, “用于目标检测中持续领域自适应的领域差距评估,” 计算机视觉与图像理解(Comput. Vis. Image Understand.),第238卷,2024年1月,文章编号103885。
[165] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, "Image-to-image translation with conditional adversarial networks," 2016, arXiv:1611.07004.
[165] P. Isola, J.-Y. Zhu, T. Zhou, 和 A. A. Efros, “基于条件对抗网络的图像到图像翻译,” 2016, arXiv:1611.07004.
[166] M.-Y. Liu, T. M. Breuel, and J. Kautz, "Unsupervised image-to-image translation networks," in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Jan. 2017, pp. 700-708.
[166] M.-Y. Liu, T. M. Breuel, 和 J. Kautz, “无监督图像到图像翻译网络,” 载于《神经信息处理系统进展》(Adv. Neural Inf. Process. Syst.),第30卷,2017年1月,第700-708页.
[167] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, "Image to image translation for domain adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4500-4509.
[167] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, 和 K. Kim, “用于领域自适应的图像到图像翻译,” 载于《IEEE/CVF计算机视觉与模式识别会议》(Conf. Comput. Vis. Pattern Recognit.),2018年6月,第4500-4509页.
[168] J. Lee, D. Shiotsuka, G. Bang, Y. Endo, T. Nishimori, K. Nakao, and S. Kamijo, "Day-to-night image translation via transfer learning to keep semantic information for driving simulator," IATSS Res., vol. 47, no. 2, pp. 251-262, Jul. 2023.
[168] J. Lee, D. Shiotsuka, G. Bang, Y. Endo, T. Nishimori, K. Nakao, 和 S. Kamijo, “通过迁移学习实现的昼夜图像翻译以保持驾驶模拟器的语义信息,” 《IATSS研究》,第47卷,第2期,2023年7月,第251-262页.
[169] D. Kothandaraman, A. Nambiar, and A. Mittal, "Domain adaptive knowledge distillation for driving scene semantic segmentation," in Proc. IEEE Winter Conf. Appl. Comput. Vis. Workshops (WACVW), Jan. 2021, pp. 134-143.
[169] D. Kothandaraman, A. Nambiar, 和 A. Mittal, “用于驾驶场景语义分割的领域自适应知识蒸馏,” 载于《IEEE冬季应用计算机视觉会议研讨会》(WACVW),2021年1月,第134-143页.
[170] H. Wang, S. Liao, and L. Shao, "AFAN: Augmented feature alignment network for cross-domain object detection," IEEE Trans. Image Process., vol. 30, pp. 4046-4056, 2021.
[170] H. Wang, S. Liao, 和 L. Shao, “AFAN:用于跨领域目标检测的增强特征对齐网络,” 《IEEE图像处理汇刊》,第30卷,2021年,第4046-4056页.
[171] J. Li, R. Xu, X. Liu, J. Ma, B. Li, Q. Zou, J. Ma, and H. Yu, "Domain adaptation based object detection for autonomous driving in foggy and rainy weather," 2023, arXiv:2307.09676.
[171] J. Li, R. Xu, X. Liu, J. Ma, B. Li, Q. Zou, J. Ma, 和 H. Yu, “基于领域自适应的自动驾驶雾雨天气目标检测,” 2023, arXiv:2307.09676.
[172] Y. Guo, R. Liang, Y. Cui, X. Zhao, and Q. Meng, "A domain-adaptive method with cycle perceptual consistency adversarial networks for vehicle target detection in foggy weather," IET Intell. Transp. Syst., vol. 16, no. 7, pp. 971-981, Jul. 2022.
[172] Y. Guo, R. Liang, Y. Cui, X. Zhao, 和 Q. Meng, “一种基于循环感知一致性对抗网络的领域自适应方法用于雾天车辆目标检测,” 《IET智能交通系统》,第16卷,第7期,2022年7月,第971-981页.
[173] X. Yu and X. Lu, "Domain adaptation of anchor-free object detection for urban traffic," Neurocomputing, vol. 582, May 2024, Art. no. 127477.
[173] X. Yu 和 X. Lu, “面向城市交通的无锚目标检测领域自适应,” 《神经计算》,第582卷,2024年5月,文章编号127477.
[174] M. Biasetton, U. Michieli, G. Agresti, and P. Zanuttigh, "Unsupervised domain adaptation for semantic segmentation of urban scenes," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2019, pp. 1211-1220.
[174] M. Biasetton, U. Michieli, G. Agresti, 和 P. Zanuttigh, “用于城市场景语义分割的无监督领域自适应,” 载于《IEEE/CVF计算机视觉与模式识别会议研讨会》(CVPRW),2019年6月,第1211-1220页.
[175] M. Saffari, M. Khodayar, and S. M. J. Jalali, "Sparse adversarial unsupervised domain adaptation with deep dictionary learning for traffic scene classification," IEEE Trans. Emerg. Topics Comput. Intell., vol. 7, no. 4, pp. 1139-1150, Apr. 2023.
[175] M. Saffari, M. Khodayar, 和 S. M. J. Jalali, “基于深度字典学习的稀疏对抗无监督领域自适应用于交通场景分类,” 《IEEE新兴计算智能专题汇刊》,第7卷,第4期,2023年4月,第1139-1150页.
[176] M. Saffari and M. Khodayar, "Low-rank sparse generative adversarial unsupervised domain adaptation for multitarget traffic scene semantic segmentation," IEEE Trans. Ind. Informat., vol. 20, no. 2, pp. 2564-2576, Feb. 2024.
[176] M. Saffari 和 M. Khodayar, “用于多目标交通场景语义分割的低秩稀疏生成对抗无监督领域自适应,” 《IEEE工业信息学汇刊》,第20卷,第2期,2024年2月,第2564-2576页.
[177] H. Zhang, G. Luo, J. Li, and F.-Y. Wang, "C2FDA: Coarse-to-fine domain adaptation for traffic object detection," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 8, pp. 12633-12647, Aug. 2022.
[177] H. Zhang, G. Luo, J. Li, 和 F.-Y. Wang, “C2FDA:用于交通目标检测的粗到细领域自适应,” 《IEEE智能交通系统汇刊》,第23卷,第8期,2022年8月,第12633-12647页.
[178] Q. Zhou, Q. Gu, J. Pang, X. Lu, and L. Ma, "Self-adversarial disentangling for specific domain adaptation," IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8954-8968, Jul. 2023.
[178] Q. Zhou, Q. Gu, J. Pang, X. Lu, 和 L. Ma, “特定领域自适应的自对抗解耦,” 《IEEE模式分析与机器智能汇刊》,第45卷,第7期,2023年7月,第8954-8968页.
[179] J. Wang, T. Shen, Y. Tian, Y. Wang, C. Gou, X. Wang, F. Yao, and C. Sun, "A parallel teacher for synthetic-to-real domain adaptation of traffic object detection," IEEE Trans. Intell. Vehicles, vol. 7, no. 3, pp. 441-455, Sep. 2022.
[179] J. Wang, T. Shen, Y. Tian, Y. Wang, C. Gou, X. Wang, F. Yao, 和 C. Sun, “用于交通目标检测的合成到真实域适应的并行教师模型,” IEEE智能车辆汇刊, 第7卷, 第3期, 页441-455, 2022年9月。
[180] L. Zhang, P. Ratsamee, B. Wang, Z. Luo, Y. Uranishi, M. Higashida, and H. Takemura, "Panoptic-aware image-to-image translation," in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023, pp. 259-268.
[180] L. Zhang, P. Ratsamee, B. Wang, Z. Luo, Y. Uranishi, M. Higashida, 和 H. Takemura, “全景感知的图像到图像转换,” 载于IEEE/CVF冬季应用计算机视觉会议(WACV)论文集, 2023年1月, 页259-268。
[181] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell, "CyCADA: Cycle-consistent adversarial domain adaptation," in Proc. Int. Conf. Mach. Learn., Jan. 2017, pp. 1989-1998.
[181] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, 和 T. Darrell, “CyCADA: 循环一致性对抗域适应,” 载于国际机器学习会议论文集, 2017年1月, 页1989-1998。
[182] G. Bang, J. Lee, Y. Endo, T. Nishimori, K. Nakao, and S. Kamijo, "Semantic and geometric-aware day-to-night image translation network," Sensors, vol. 24, no. 4, p. 1339, Feb. 2024.
[182] G. Bang, J. Lee, Y. Endo, T. Nishimori, K. Nakao, 和 S. Kamijo, “语义与几何感知的昼夜图像转换网络,” 传感器, 第24卷, 第4期, 页1339, 2024年2月。
[183] T.-D. Truong, N. Le, B. Raj, J. Cothren, and K. Luu, "FREDOM: Fairness domain adaptation approach to semantic scene understanding," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 19988-19997.
[183] T.-D. Truong, N. Le, B. Raj, J. Cothren, 和 K. Luu, “FREDOM: 用于语义场景理解的公平性域适应方法,” 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 2023年6月, 页19988-19997。
[184] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1-9.
[184] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, 和 A. Rabinovich, “深入卷积网络,” 载于IEEE计算机视觉与模式识别会议(CVPR)论文集, 2015年6月, 页1-9。
[185] A. Cherian and A. Sullivan, "Sem-GAN: Semantically-consistent image-to-image translation," in Proc. IEEE Winter Conf. Appl. Com-put. Vis. (WACV), Jan. 2019, pp. 1797-1806.
[185] A. Cherian 和 A. Sullivan, “Sem-GAN: 语义一致的图像到图像转换,” 载于IEEE冬季应用计算机视觉会议(WACV)论文集, 2019年1月, 页1797-1806。
[186] S.-W. Huang, C.-T. Lin, S. Chen, Y.-Y. Wu, P.-H. Hsu, and S. Lai, "Aug-GAN: Cross domain adaptation with GAN-based data augmentation," in Proc. Eur. Conf. Comput. Vis. (ECCV), Jan. 2018, pp. 731-744.
[186] S.-W. Huang, C.-T. Lin, S. Chen, Y.-Y. Wu, P.-H. Hsu, 和 S. Lai, “Aug-GAN: 基于GAN的数据增强的跨域适应,” 载于欧洲计算机视觉会议(ECCV)论文集, 2018年1月, 页731-744。
[187] R. Volpi, P. Morerio, S. Savarese, and V. Murino, "Adversarial feature augmentation for unsupervised domain adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 5495-5504.
[187] R. Volpi, P. Morerio, S. Savarese, 和 V. Murino, “用于无监督域适应的对抗特征增强,” 载于IEEE/CVF计算机视觉与模式识别会议论文集, 2018年6月, 页5495-5504。
[188] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," 2017, arXiv:1706.03762.
[188] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, 和 I. Polosukhin, “注意力机制即一切,” 2017年, arXiv:1706.03762。
[189] M. Salem, A. Gomaa, and N. Tsurusaki, "Detection of earthquake-induced building damages using remote sensing data and deep learning: A case study of mashiki town, Japan," in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2023, pp. 2350-2353.
[189] M. Salem, A. Gomaa, 和 N. Tsurusaki, “利用遥感数据和深度学习检测地震引起的建筑损伤:以日本益城町为例,” 载于IEEE国际地球科学与遥感研讨会论文集, 2023年7月, 页2350-2353。
[190] A. Gomaa, M. M. Abdelwahab, and M. Abo-Zahhad, "Real-time algorithm for simultaneous vehicle detection and tracking in aerial view videos," in Proc. IEEE 61st Int. Midwest Symp. Circuits Syst. (MWSCAS), Aug. 2018, pp. 222-225.
[190] A. Gomaa, M. M. Abdelwahab, 和 M. Abo-Zahhad, “用于航拍视频中车辆实时检测与跟踪的算法,” 载于IEEE第61届中西部电路与系统研讨会(MWSCAS)论文集, 2018年8月, 页222-225。
[191] M. A. Khan and H. Park, "Exploring explainable artificial intelligence techniques for interpretable neural networks in traffic sign recognition systems," Electronics, vol. 13, no. 2, p. 306, Jan. 2024.
[191] M. A. Khan 和 H. Park, “探索可解释人工智能技术以实现交通标志识别系统中神经网络的可解释性,” 电子学, 第13卷, 第2期, 页306, 2024年1月。
[192] C. Bustos, D. Rhoads, A. Solé-Ribalta, D. Masip, A. Arenas, A. Lapedriza, and J. Borge-Holthoefer, "Explainable, automated urban interventions to improve pedestrian and vehicle safety," Transp. Res. C, Emerg. Technol., vol. 125, Apr. 2021, Art. no. 103018.
[192] C. Bustos, D. Rhoads, A. Solé-Ribalta, D. Masip, A. Arenas, A. Lapedriza, 和 J. Borge-Holthoefer, “可解释的自动化城市干预以提升行人和车辆安全,”《交通研究C,紧急技术》(Transp. Res. C, Emerg. Technol.),第125卷,2021年4月,文章编号103018。
[193] S. Kolekar, S. Gite, B. Pradhan, and A. Alamri, "Explainable AI in scene understanding for autonomous vehicles in unstructured traffic environments on Indian roads using the inception U-Net model with grad-CAM visualization," Sensors, vol. 22, no. 24, p. 9677, Dec. 2022.
[193] S. Kolekar, S. Gite, B. Pradhan, 和 A. Alamri, “基于Inception U-Net模型及grad-CAM可视化的可解释人工智能在印度道路非结构化交通环境中自动驾驶场景理解中的应用,”《传感器》(Sensors),第22卷,第24期,页9677,2022年12月。
[194] J. Dong, S. Chen, M. Miralinaghi, T. Chen, P. Li, and S. Labi, "Why did the AI make that decision? Towards an explainable artificial intelligence (XAI) for autonomous driving systems," Transp. Res. C, Emerg. Technol., vol. 156, Nov. 2023, Art. no. 104358.
[194] J. Dong, S. Chen, M. Miralinaghi, T. Chen, P. Li, 和 S. Labi, “人工智能为何做出该决策?面向自动驾驶系统的可解释人工智能(XAI)研究,”《交通研究C,紧急技术》(Transp. Res. C, Emerg. Technol.),第156卷,2023年11月,文章编号104358。
[195] K. Han, Y. Wang, J. Guo, Y. Tang, and E. Wu, "Vision GNN: An image is worth graph of nodes," in Proc. Adv. Neural Inf. Process. Syst., A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., Jan. 2022, pp. 8291-8303.
[195] K. Han, Y. Wang, J. Guo, Y. Tang, 和 E. Wu, “视觉图神经网络(Vision GNN):一幅图像胜过节点图,”载于《神经信息处理系统进展》(Proc. Adv. Neural Inf. Process. Syst.),A. H. Oh, A. Agarwal, D. Belgrave, 和 K. Cho编辑,2022年1月,第8291-8303页。
[196] J. Regan and M. Khodayar, "A triplet graph convolutional network with attention and similarity-driven dictionary learning for remote sensing image retrieval," Expert Syst. Appl., vol. 232, Dec. 2023, Art. no. 120579.
[196] J. Regan 和 M. Khodayar, “基于三元组图卷积网络结合注意力机制与相似性驱动字典学习的遥感图像检索,”《专家系统应用》(Expert Syst. Appl.),第232卷,2023年12月,文章编号120579。
[197] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, "A survey on vision transformer," IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87-110, Jan. 2023.
[197] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, 和 D. Tao, “视觉变换器(Vision Transformer)综述,”《IEEE模式分析与机器智能汇刊》(IEEE Trans. Pattern Anal. Mach. Intell.),第45卷,第1期,2023年1月,页87-110。
[198] G. Li, M. Müller, A. Thabet, and B. Ghanem, "DeepGCNs: Can GCNs Go As Deep As CNNs?" in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2019, pp. 9266-9275.
[198] G. Li, M. Müller, A. Thabet, 和 B. Ghanem, “DeepGCNs:图卷积网络(GCNs)能否达到卷积神经网络(CNNs)的深度?”载于《IEEE国际计算机视觉会议论文集》(Proc. IEEE Int. Conf. Comput. Vis.),2019年10月,页9266-9275。
[199] T. K. Rusch, M. M. Bronstein, and S. Mishra, "A survey on oversmooth-ing in graph neural networks," 2023, arXiv:2303.10993.
[199] T. K. Rusch, M. M. Bronstein, 和 S. Mishra, “图神经网络中过度平滑(oversmoothing)现象综述,”2023年,arXiv:2303.10993。
[200] J. Li, Q. Zhang, W. Liu, A. B. Chan, and Y.-G. Fu, "Another perspective of over-smoothing: Alleviating semantic over-smoothing in deep GNNs," IEEE Trans. Neural Netw. Learn. Syst., early access, May 29, 2024, doi: 10.1109/TNNLS.2024.3402317.
[200] J. Li, Q. Zhang, W. Liu, A. B. Chan, 和 Y.-G. Fu, “过度平滑的另一视角:缓解深层图神经网络中的语义过度平滑,”《IEEE神经网络与学习系统汇刊》(IEEE Trans. Neural Netw. Learn. Syst.),提前发布,2024年5月29日,doi: 10.1109/TNNLS.2024.3402317。
[201] L. J. Zhang, J. J. Fang, Y. X. Liu, H. Feng Le, Z. Q. Rao, and J. X. Zhao, "CR-YOLOv8: Multiscale object detection in traffic sign images," IEEE Access, vol. 12, pp. 219-228, 2024.
[201] L. J. Zhang, J. J. Fang, Y. X. Liu, H. Feng Le, Z. Q. Rao, 和 J. X. Zhao, “CR-YOLOv8:交通标志图像中的多尺度目标检测,”《IEEE Access》,第12卷,2024年,页219-228。
[202] S. R. Dubey and S. K. Singh, "Transformer-based generative adversarial networks in computer vision: A comprehensive survey," IEEE Trans. Artif. Intell., vol. 5, no. 10, pp. 4851-4867, Oct. 2024.
[202] S. R. Dubey 和 S. K. Singh, “基于变换器的生成对抗网络在计算机视觉中的应用综述,”《IEEE人工智能汇刊》(IEEE Trans. Artif. Intell.),第5卷,第10期,2024年10月,页4851-4867。
[203] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, "The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes," in Proc. IEEE Conf. Com-put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 3234-3243.
[203] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, 和 A. M. Lopez, “SYNTHIA数据集:用于城市场景语义分割的大规模合成图像集合,”载于《IEEE计算机视觉与模式识别会议论文集》(Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)),2016年6月,页3234-3243。
[204] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, "Playing for data: Ground truth from computer games," in Proc. Eur. Conf. Comput. Vis., in Lecture Notes in Computer Science: Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. 9906, Jan. 2016, pp. 102-118.
[204] S. R. Richter, V. Vineet, S. Roth, 和 V. Koltun, “为数据而玩:来自计算机游戏的真实标注”,发表于欧洲计算机视觉会议论文集,计算机科学讲义系列:包括人工智能讲义和生物信息学讲义子系列,第9906卷,2016年1月,第102-118页。
[205] R. Zhang, K. Xiong, H. Du, D. Niyato, J. Kang, X. Shen, and H. V. Poor, "Generative AI-enabled vehicular networks: Fundamentals, framework, and case study," IEEE Netw., vol. 38, no. 4, pp. 259-267, Jul. 2024.
[205] R. Zhang, K. Xiong, H. Du, D. Niyato, J. Kang, X. Shen, 和 H. V. Poor, “生成式人工智能驱动的车载网络:基础、框架与案例研究”,IEEE网络,卷38,第4期,2024年7月,第259-267页。
[206] E. Galazka, T. T. Niemirepo, and J. Vanne, "CiThruS2: Open-source photorealistic 3D framework for driving and traffic simulation in real time," in Proc. IEEE Int. Intell. Transp. Syst. Conf. (ITSC), Sep. 2021, pp. 3284-3291.
[206] E. Galazka, T. T. Niemirepo, 和 J. Vanne, “CiThruS2:用于实时驾驶和交通仿真的开源逼真三维框架”,发表于IEEE国际智能交通系统会议(ITSC),2021年9月,第3284-3291页。
[207] X. Li, J. Park, C. Reberg-Horton, S. Mirsky, E. Lobaton, and L. Xiang, "Photorealistic arm robot simulation for 3D plant reconstruction and automatic annotation using unreal engine 5," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2024, pp. 5480-5488.
[207] X. Li, J. Park, C. Reberg-Horton, S. Mirsky, E. Lobaton, 和 L. Xiang, “基于Unreal Engine 5的逼真机械臂仿真,用于三维植物重建和自动标注”,发表于IEEE/CVF计算机视觉与模式识别会议研讨会(CVPRW),2024年6月,第5480-5488页。
[208] E. Yurtsever, D. Yang, I. M. Koc, and K. A. Redmill, "Photorealism in driving simulations: Blending generative adversarial image synthesis with rendering," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 23114-23123, Dec. 2022.
[208] E. Yurtsever, D. Yang, I. M. Koc, 和 K. A. Redmill, “驾驶仿真中的逼真效果:生成对抗图像合成与渲染的融合”,IEEE智能交通系统汇刊,卷23,第12期,2022年12月,第23114-23123页。
[209] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, "A survey on multimodal large language models," 2023, arXiv:2306.13549.
[209] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, 和 E. Chen, “多模态大型语言模型综述”,2023年,arXiv:2306.13549。
[210] H. Wang, J. Qin, A. Bastola, X. Chen, J. Suchanek, Z. Gong, and A. Razi, "VisionGPT: LLM-assisted real-time anomaly detection for safe visual navigation," 2024, arXiv:2403.12415.
[210] H. Wang, J. Qin, A. Bastola, X. Chen, J. Suchanek, Z. Gong, 和 A. Razi, “VisionGPT:基于大型语言模型辅助的实时异常检测,用于安全视觉导航”,2024年,arXiv:2403.12415。
[211] T.-A. To, M.-N. Tran, T.-B. Ho, T.-L. Ha, Q.-T. Nguyen, H.-C. Luong, T.-D. Cao, and M.-T. Tran, "Multi-perspective traffic video description model with fine-grained refinement approach," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2024, pp. 7075-7084.
[211] T.-A. To, M.-N. Tran, T.-B. Ho, T.-L. Ha, Q.-T. Nguyen, H.-C. Luong, T.-D. Cao, 和 M.-T. Tran, “多视角交通视频描述模型及细粒度优化方法”,发表于IEEE/CVF计算机视觉与模式识别会议研讨会(CVPRW),2024年6月,第7075-7084页。
[212] C. Cui, Y. Ma, X. Cao, W. Ye, and Z. Wang, "Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles," IEEE Intell. Transp. Syst. Mag., vol. 16, no. 4, pp. 81-94, Jul. 2024.
[212] C. Cui, Y. Ma, X. Cao, W. Ye, 和 Z. Wang, “接收、推理与响应:在自动驾驶中实现‘言出即行’的多模态大型语言模型”,IEEE智能交通系统杂志,卷16,第4期,2024年7月,第81-94页。
[213] L. Kong, X. Xu, J. Ren, W. Zhang, L. Pan, K. Chen, W. T. Ooi, and Z. Liu, "Multi-modal data-efficient 3D scene understanding for autonomous driving," 2024, arXiv:2405.05258.
[213] L. Kong, X. Xu, J. Ren, W. Zhang, L. Pan, K. Chen, W. T. Ooi, 和 Z. Liu, “面向自动驾驶的多模态高效三维场景理解”,2024年,arXiv:2405.05258。
[214] K. Dasgupta, A. Das, S. Das, U. Bhattacharya, and S. Yogamani, "Spatio-contextual deep network-based multimodal pedestrian detection for autonomous driving," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 15940-15950, Sep. 2022.
[214] K. Dasgupta, A. Das, S. Das, U. Bhattacharya, 和 S. Yogamani, “基于时空上下文深度网络的多模态行人检测,用于自动驾驶”,IEEE智能交通系统汇刊,卷23,第9期,2022年9月,第15940-15950页。
[215] T. Qian, J. Chen, L. Zhuo, Y. Jiao, and Y. Jiang, "NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario," in Proc. AAAI Conf. Artif. Intell., Jan. 2023, vol. 38, no. 5, pp. 4542-4550.
[215] T. Qian, J. Chen, L. Zhuo, Y. Jiao, 和 Y. Jiang, “NuScenes-QA:面向自动驾驶场景的多模态视觉问答基准”,发表于AAAI人工智能会议,2023年1月,卷38,第5期,第4542-4550页。
[216] J. Li, Y. Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, "RoadFormer: Duplex transformer for RGB-normal semantic road scene parsing," IEEE Trans. Intell. Vehicles, vol. 9, no. 7, pp. 5163-5172, Jul. 2024.
[216] J. Li, Y. Zhang, P. Yun, G. Zhou, Q. Chen, 和 R. Fan, “RoadFormer:用于RGB-法线语义道路场景解析的双工变换器”,IEEE智能车辆汇刊,卷9,第7期,页5163-5172,2024年7月。
[217] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Gläser, F. Timm, W. Wiesbeck, and K. Dietmayer, "Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges," IEEE Trans. Intell. Transp. Syst., vol. 22, no. 3, pp. 1341-1360, Mar. 2021.
[217] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Gläser, F. Timm, W. Wiesbeck, 和 K. Dietmayer, “自动驾驶的深度多模态目标检测与语义分割:数据集、方法与挑战”,IEEE智能交通系统汇刊,卷22,第3期,页1341-1360,2021年3月。

PARYA DOLATYABI (Graduate Student Member, IEEE) received the B.Sc. degree in computer science from the Shahid Bahonar University of Kerman, Kerman, Iran, in 2003, the M.Sc. degree in information technology engineering from the K. N. Toosi University of Technology, Tehran, Iran, in 2007, and the M.Sc. degree in computer engineering from the University of Tulsa (TU), Tulsa, OK, USA, in 2024, where she is currently pursuing the Ph.D. degree in computer science. In 2023, she completed an internship as a Research Assistant with the Laureate Institute for Brain Research (LIBR), Tulsa. Her primary research interests include the theories and applications of deep learning models in computer vision and computational neuroscience. Additionally, she serves as a Reviewer for IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION and Sustainable Computing: Informatics and Systems journals.
PARYA DOLATYABI(IEEE研究生会员)于2003年获得伊朗克尔曼Shahid Bahonar大学计算机科学学士学位,2007年获得伊朗德黑兰K. N. Toosi理工大学信息技术工程硕士学位,2024年获得美国俄克拉荷马州塔尔萨大学(University of Tulsa,TU)计算机工程硕士学位,目前在该校攻读计算机科学博士学位。2023年,她在塔尔萨Laureate脑研究所(LIBR)完成了研究助理实习。她的主要研究兴趣包括深度学习模型在计算机视觉和计算神经科学中的理论与应用。此外,她还担任IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION和Sustainable Computing: Informatics and Systems期刊的审稿人。

JACOB REGAN (Graduate Student Member, IEEE) received the B.Sc. and M.Sc. degrees in computer science from the University of Tulsa (TU), Tulsa, Oklahoma, in 2021 and 2022, respectively, where he is currently pursuing the Ph.D. degree in computer science. His main research interests include artificial intelligence, machine learning, computer vision, and transportation network simulation and optimization.
JACOB REGAN(IEEE研究生会员)于2021年和2022年分别获得美国俄克拉荷马州塔尔萨大学(University of Tulsa,TU)计算机科学学士和硕士学位,目前在该校攻读计算机科学博士学位。他的主要研究兴趣包括人工智能、机器学习、计算机视觉以及交通网络仿真与优化。

MAHDI KHODAYAR (Member, IEEE) received the B.Sc. degree in computer engineering and the M.Sc. degree in artificial intelligence from the K. N. Toosi University of Technology, Tehran, Iran, in 2013 and 2015, respectively, and the Ph.D. degree in electrical engineering from Southern Methodist University, Dallas, TX, USA, in 2020. In 2017, he was a Research Assistant with the College of Computer and Information Science, Northeastern University, Boston, MA, USA. He is currently an Assistant Professor with the Department of Computer Science, The University of Tulsa, Tulsa, OK, USA. His main research interests include machine learning and statistical pattern recognition. He is focused on DL, sparse modeling, and spatiotemporal pattern recognition. He has served as a Reviewer for many reputable journals, including IEEE TRANSACTIONS ON Neural Networks and Learning Systems, IEEE Transactions on Industrial Informatics, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Sustainable Energy, and IEEE Transactions on Power Systems. Additionally, he serves as an Editor for IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION.
MAHDI KHODAYAR(IEEE会员)于2013年和2015年分别获得伊朗德黑兰K. N. Toosi理工大学计算机工程学士和人工智能硕士学位,2020年获得美国德克萨斯州达拉斯南方卫理公会大学(Southern Methodist University)电气工程博士学位。2017年,他曾在美国马萨诸塞州波士顿东北大学计算机与信息科学学院担任研究助理。目前,他是美国俄克拉荷马州塔尔萨大学计算机科学系助理教授。他的主要研究兴趣包括机器学习和统计模式识别,重点关注深度学习(DL)、稀疏建模和时空模式识别。他曾为多家知名期刊担任审稿人,包括IEEE神经网络与学习系统汇刊(IEEE TRANSACTIONS ON Neural Networks and Learning Systems)、IEEE工业信息学汇刊(IEEE Transactions on Industrial Informatics)、IEEE模糊系统汇刊(IEEE Transactions on Fuzzy Systems)、IEEE可持续能源汇刊(IEEE Transactions on Sustainable Energy)和IEEE电力系统汇刊(IEEE Transactions on Power Systems)。此外,他还担任IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION期刊的编辑。